Topic Modelling with LDA

IlyasIlyas Member Posts: 12 Contributor II
edited July 2021 in Help
Hi Guys,
How do I set up a process with LDA to do topic modeling, please? I placed the LDA operator to process window, but don't know how to import word files into it.

Best Answers

  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Solution Accepted
    Ilyas,

    You basis setup would be a Loop Files (To grab all your documents) with a Read Document inside of it. 

    The (?i).*docx tells the operator to only use the files that have a docx extension  


    <?xml version="1.0" encoding="UTF-8"?><process version="9.9.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.9.002" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:loop_files" compatibility="9.9.002" expanded="true" height="82" name="Loop Folder With Files" width="90" x="179" y="85">
            <parameter key="filter_type" value="regex"/>
            <parameter key="filter_by_regex" value="(?i).*docx"/>
            <parameter key="recursive" value="false"/>
            <parameter key="enable_macros" value="false"/>
            <parameter key="macro_for_file_name" value="file_name"/>
            <parameter key="macro_for_file_type" value="file_type"/>
            <parameter key="macro_for_folder_name" value="folder_name"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:read_document" compatibility="9.3.001" expanded="true" height="68" name="Read Document" width="90" x="246" y="34">
                <parameter key="extract_text_only" value="true"/>
                <parameter key="use_file_extension_as_type" value="true"/>
                <parameter key="content_type" value="txt"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <connect from_port="file object" to_op="Read Document" to_port="file"/>
              <connect from_op="Read Document" from_port="output" to_port="output 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="operator_toolbox:lda" compatibility="2.11.000" expanded="true" height="124" name="Extract Topics from Documents (LDA)" width="90" x="380" y="85">
            <parameter key="number_of_topics" value="10"/>
            <parameter key="show_optimization_settings" value="false"/>
            <parameter key="use_alpha_heuristics" value="true"/>
            <parameter key="alpha_sum" value="0.1"/>
            <parameter key="use_beta_heuristics" value="true"/>
            <parameter key="beta" value="0.01"/>
            <parameter key="optimize_hyperparameters" value="true"/>
            <parameter key="optimize_interval_for_hyperparameters" value="10"/>
            <parameter key="iterations" value="1000"/>
            <parameter key="top_words_per_topic" value="5"/>
            <parameter key="stopword language" value="english"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="enable_logging" value="false"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="include_meta_data" value="true"/>
          </operator>
          <connect from_op="Loop Folder With Files" from_port="output 1" to_op="Extract Topics from Documents (LDA)" to_port="col"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>


  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    Thank you, Marco. I tried, but couldn't get it to run (see below).

    I am trying to run topic modeling for interview transcripts. The original transcript files are 'Text Document'. Would it be easier to run LDA with that?



  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    edited July 2021 Solution Accepted
    Ilyas,

    Just adjust the filter by regex string on the parameters or remove it if you only have he file you need at that folder.
    The error that is shows is telling me that there are no files with the .docx (Word Document) on your folder. If your files are .txt (Text Files) just change the .docx to a .txt 

    If you need further help please type and @ and my name and I'll receive an e-mail alert with the latest update on your post.

  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    @MarcoBarradas,
    Thank you again for the direction. I can't get the Loop Files operator to see txt files. Could you please help?

    In summary, I still cannot make the process run. I have 10 separate txt files (for the 10 interviews I conducted). I also have a combined single txt file for all the interviews. Which one is best to use; individual txt files or a single large file?


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    Hi,
    can you just try
    .*

     as regex? That takes every file no matter how its called.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    Hi Mschmitz,

    Thank you. I got the below error. I wonder if I am using the wrong operator. I took the 'Loop Files' operator and renamed it 'Loop Folder with File'. Would this cause an issue?


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    Hi,
    no. you propably did not connect the output of your read operator in the loop files correctly. Can you maybe post the process XML?
    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    Sorry, I don't know how to post process XML. Please see .rmp file attached.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    Hi @Ilyas ,
    rmp files are nothing else then XML files. So you posted the right thing. You can open the XML panel to make the export easier, but thats a minor thing.

    Attached is an updated process. You had no operator within the Loop files which then actually reads the files. I added Read Document for txt files.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    I managed to import my interview transcripts to Local Repository. See top left below.


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    In this case you can just drag and drop them in and use an Append operator to merge them into one data set.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    @mschmitz Thank you Martin. 

    I am finally making progress.

    My first file in the Local Repository (All Transcripts Combined) includes everything. That is all I need to use. So can I do the below?


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    Hi,
    you can remove the loopfiles in this case.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    Do you mean to remove all the other individual text files from the local repository?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    no, just the loop files operator in your process.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    But it does not like it. Please see below.


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    There are two different LDA operators. You want to use the other one, called Extract Topics from Data.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    Thank you, it worked!

    In the text attribute parameter (top right) I put in Text and it did not work. 16:07:44 Go. was already there and it works with that. Is the below okay now please?


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    Hi,
    if the column holding the text is called 16... Go, yes.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    My next challenge is to understand the results. How do I name the 10 topics from the results, please?


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    Hi,
    have a look at the 2nd output of the operator. It contains the top words associated which each topic.
    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • IlyasIlyas Member Posts: 12 Contributor II
    Solution Accepted
    Am I correct in saying that weight means the number of times a word appeared in the document? For example, think:180, digital:125, etc...


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Solution Accepted
    Yes,
    i am actually not sure if it is the actual sum or with some weights and things, but this is how you can interpret it.
    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.