Options

Generate Prediction Ranking after LDA (Topic Modeling) Process

svtorykhsvtorykh Member Posts: 35 Guru
edited December 2018 in Help

Hi RM Community!

 

I'm running LDA (Topic Modeling) process on my text data and generating 30 topics. How can I apply Generate Prediction Ranking after LDA process, so my output will contain 3-5 columns with highest confidence level topics for specific document (row in the table)?

 

Thanks!

 

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi,

    since LDA provides confidences and predictions there is no difference to classification problems here. That saying, i can't tell you how to do this off the top of my head.. maybe @sgenzer knows?

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    svtorykhsvtorykh Member Posts: 35 Guru

    Thanks! Hope someone from RM team can help with this, as LDA generates one final Prediction of topic based on highest confidence value of all 30 topics for that document. I need to be more flexible and be able to generate more ranked predicted topics columns in my output based on 2nd, 3rd... confidence values. This is potentially possible to do manually in excel, but what's the value of RM then:)

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    ok challenge accepted :) Here's a classic sgenzer ETL hack job for you. It's not pretty but the 2nd to last operator (Filter Example Range) allows you to select how many confidences you want.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.2.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="body" value="&quot;Hello. I am a university student. How do I download RapidMiner and get an academic license ?&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="Subprocess (2)" width="90" x="179" y="34">
    <process expanded="true">
    <operator activated="true" class="nominal_to_text" compatibility="8.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="body"/>
    <description align="center" color="transparent" colored="false" width="126">body only</description>
    </operator>
    <operator activated="true" class="replace" compatibility="8.2.000" expanded="true" height="82" name="Replace (13)" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="body"/>
    <parameter key="replace_what" value="[&lt;]a href.*\s"/>
    <description align="center" color="transparent" colored="false" width="126">getting rid of aref links</description>
    </operator>
    <operator activated="true" class="trim" compatibility="8.2.000" expanded="true" height="82" name="Trim (2)" width="90" x="313" y="34"/>
    <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="34">
    <list key="specify_weights">
    <parameter key="body" value="1.0"/>
    <parameter key="bodyOriginal" value="0.0"/>
    <parameter key="author.id" value="0.0"/>
    <parameter key="conversationId" value="0.0"/>
    <parameter key="author.type" value="0.0"/>
    <parameter key="messageId" value="0.0"/>
    </list>
    </operator>
    <connect from_port="in 1" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Replace (13)" to_port="example set input"/>
    <connect from_op="Replace (13)" from_port="example set output" to_op="Trim (2)" to_port="example set input"/>
    <connect from_op="Trim (2)" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">ETL</description>
    </operator>
    <operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve LDA model" width="90" x="179" y="187">
    <parameter key="repository_entry" value="../Models/LDA model"/>
    </operator>
    <operator activated="true" class="operator_toolbox:apply_model_documents" compatibility="1.0.000" expanded="true" height="103" name="Apply Model (Documents)" width="90" x="380" y="34"/>
    <operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="514" y="34">
    <parameter key="attribute_name" value="prediction(Topic)"/>
    <list key="set_additional_roles">
    <parameter key="id" value="regular"/>
    <parameter key="text" value="regular"/>
    </list>
    </operator>
    <operator activated="true" class="transpose" compatibility="8.2.000" expanded="true" height="82" name="Transpose" width="90" x="648" y="34"/>
    <operator activated="true" class="sort" compatibility="8.2.000" expanded="true" height="82" name="Sort" width="90" x="782" y="34">
    <parameter key="attribute_name" value="att_1"/>
    <parameter key="sorting_direction" value="decreasing"/>
    </operator>
    <operator activated="true" class="filter_example_range" compatibility="8.2.000" expanded="true" height="82" name="Filter Example Range" width="90" x="916" y="34">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="6"/>
    <description align="center" color="transparent" colored="false" width="126">Choose a range from 1 to 4+</description>
    </operator>
    <operator activated="true" class="transpose" compatibility="8.2.000" expanded="true" height="82" name="Transpose (2)" width="90" x="1050" y="34"/>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Subprocess (2)" to_port="in 1"/>
    <connect from_op="Subprocess (2)" from_port="out 1" to_op="Apply Model (Documents)" to_port="doc"/>
    <connect from_op="Retrieve LDA model" from_port="output" to_op="Apply Model (Documents)" to_port="mod"/>
    <connect from_op="Apply Model (Documents)" from_port="exa" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
    <connect from_op="Transpose (2)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

     

  • Options
    svtorykhsvtorykh Member Posts: 35 Guru

    Thanks for the effort guys! Will this work with 20K documents as well?

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    I see no reason why not..... :smileywink:

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi @sgenzer and @svtorykh,

     

    i just reminded myself that there is an operator for this. It's called Generate prediction ranking and should do the trick! 

     

    Sorry for not remembering this first. I think i only used this operator once 4 years ago.

     

    Best,

    Martin

     

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    svtorykhsvtorykh Member Posts: 35 Guru

    Actually, that was the first operator I tried to use, but it wouldn't work with LDA confidences for some reason. Is it possible to see the process flow of applying ranking generator after LDA operator? I think some of the attributes must be changed, but not sure how to do it.

  • Options
    svtorykhsvtorykh Member Posts: 35 Guru

    Hi Scott,

    In your process, at which point of time Att1 is created? Can't find it in the example set.

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi @svtorykh,

    you found a bug in LDA :/ - somewhat.

     

    usually, all confidences are identified by its role. It's usually confidence_CLASSNAME. When I programmed the operator I used Confidence_CLASSNAME (with a capital C) as the role for the probabilities. Thus it's not working. You need to manually switch the roles of the confidence attributes (maybe with a loop). Attached is a process which demonstrates that it works afterward.

     

    I will fix this bug, but most likely not this nor next week. there is another feature which needs to be merged first and I am traveling next week 3 days.

     

    Best,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="concurrency:loop" compatibility="8.1.001" expanded="true" height="82" name="Loop" width="90" x="45" y="34">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
    <parameter key="text" value="Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. &#10;&#10;Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. &#10;&#10;Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. &#10;&#10;Nam liber tempor **** soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. &#10;&#10;Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis. &#10;&#10;At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, At accusam aliquyam diam diam dolore dolores duo eirmod eos erat, et nonumy sed tempor et et invidunt justo labore Stet clita ea et gubergren, kasd magna no rebum. sanctus sea sed takimata ut vero voluptua. est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur"/>
    </operator>
    <connect from_op="Create Document" from_port="output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Get a Collection of documents</description>
    </operator>
    <operator activated="true" class="loop_collection" compatibility="8.1.001" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34">
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="246" y="34"/>
    <operator activated="false" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="380" y="136">
    <parameter key="min_chars" value="2"/>
    </operator>
    <connect from_port="single" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Text Prep using Text Mining</description>
    </operator>
    <operator activated="true" class="operator_toolbox:lda" compatibility="1.0.000" expanded="true" height="124" name="LDA" width="90" x="313" y="34">
    <parameter key="number_of_topics" value="2"/>
    <parameter key="iterations" value="100"/>
    <parameter key="use_local_random_seed" value="true"/>
    <parameter key="local_random_seed" value="1997"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
    <parameter key="attribute_name" value="confidence(Topic_0)"/>
    <parameter key="target_role" value="confidence_Topic_0"/>
    <list key="set_additional_roles">
    <parameter key="confidence(Topic_1)" value="confidence_Topic_1"/>
    </list>
    </operator>
    <operator activated="true" class="generate_prediction_ranking" compatibility="8.1.001" expanded="true" height="82" name="Generate Prediction Ranking" width="90" x="581" y="34">
    <parameter key="number_of_ranks" value="2"/>
    </operator>
    <connect from_op="Loop" from_port="output 1" to_op="Loop Collection" to_port="collection"/>
    <connect from_op="Loop Collection" from_port="output 1" to_op="LDA" to_port="col"/>
    <connect from_op="LDA" from_port="exa" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Generate Prediction Ranking" to_port="example set input"/>
    <connect from_op="Generate Prediction Ranking" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    "Generate Prediction Ranking" - nope never seen that one before! You learn something every day! Thanks @mschmitz for rendering my messy ETL completely useless :)

     

    @svtorykh that Att1 was generated when you Transpose - does it automatically.

     

    Scott

     

     

  • Options
    svtorykhsvtorykh Member Posts: 35 Guru

    Thanks! Would you please fix the descriptions of Alpha and Beta Heuristics? I think the descriptions need to be switched between the two!

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    @svtorykh,

    sure will do! Thanks for reporting.


    FYI - there a small bug in the current market place version. Alpha heuristics is wrong by a factor of #topics. This will be fixed in the next version. I've further added a feature to control Mallet's auto-tuning of alpha/beta.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.