RapidMiner 9.8 Beta is now available

Be one of the first to get your hands on the new features. More details and downloads here:

GET RAPIDMINER 9.8 BETA

Text Mining: analyse PDFs with a dictionary which has categories

nsmithnsmith Member Posts: 5 Learner I
edited August 28 in Help

Hello,

I want to analyse a number of PDFs (35) with kind of a dictionary. The output of the analysis should be an Excel File which shows how often every single word of the dictionary appears in the PDFs. Maybe it's important to know that the dictionary is not only a list of words. Instead the words are classified into five categories. Thus the analysis should give me information about how much is reported on the words of the dictionary and about which category is reported the most.

I already read lots of questions here and also watched tutorials, but I could not find exactly what I need. Trial and error didn't work as well up to now. Hope someone can help me.

Many thanks in advance,

Nina

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,636  RM Data Scientist
    Hi,
    this really depends on the format of your PDF. Did you try to just read one of them using the Read Document operator?

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • nsmithnsmith Member Posts: 5 Learner I
    edited August 28
    Yes, I read the PDFs with the Read Document Operator - that works. The problem is the dictionary. I'm not able to filter the PDFs with my dictionary (which consists of words in a excel file), so that I can see how often each word appears in the PDF. Furthermore I don't know how I can take account of the categories in my dictionary. Wheter there is a possibility that RapidMiner can recognize categories in a dictionary (maybe if for example each category is written in a new tab of my excel file)  or if I need some additional operator for that.

    This is how I tried to get my desired result:


    <?xml version="1.0" encoding="UTF-8"?><process version="9.7.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.7.002" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="9.3.001" expanded="true" height="68" name="Read Document" width="90" x="179" y="85">
            <parameter key="file" 
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="SYSTEM"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="9.3.001" expanded="true" height="103" name="Process Documents" width="90" x="380" y="34">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="false"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_absolute" value="1"/>
            <parameter key="prune_above_absolute" value="999999"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="112" y="85">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="187">
                <parameter key="transform_to" value="lower case"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="9.3.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="112" y="289"/>
              <operator activated="true" class="text:generate_n_grams_terms" compatibility="9.3.001" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="313" y="85">
                <parameter key="max_length" value="2"/>
              </operator>
              <operator activated="true" class="text:filter_by_length" compatibility="9.3.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="313" y="187">
                <parameter key="min_chars" value="2"/>
                <parameter key="max_chars" value="25"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="9.3.001" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="581" y="187">
                <parameter key="file" 
                <parameter key="case_sensitive" value="false"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
              <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
              <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    Unfortunately it doesn't work.

    Thanks for your help,
    Nina

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,530   Unicorn
    It sounds like you want to use a specific wordlist and then count the words based on that wordlist (which are further grouped into 5 categories).  You should be able to input your desired wordlist into the input port of the Process Documents operator.  You can then use the Wordlist to Data operator on the resulting wordlist to turn it into a normal dataset that you can then summarize or use your grouping to do the category analysis.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • nsmithnsmith Member Posts: 5 Learner I
    Thanks for your answer @Telcontar120 !
    Yes, you're right. I have a word list with key words (which are categorized) and want to scan all my PDFs for these words. Thus I only want to see this words and their occurence in the result view. 
    I tried your proposal, but I couldn't put the Wordlist into the input port and then connect with the process documents operator as an error occured. Furthermore I'm not sure where to add all my PDFs that should be analysed. Are both, the wordlist and the PDFs, set as an input for the process documents operator? 

    I hope my problem is not too confusing. Maybe it helps to have a look at the XML I posted before. 

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,530   Unicorn
    @mschmitz is there a way to import a wordlist from an external file to be used as input for Process Documents? Or a relevant converter that can be used? Upon looking at the operator more closely, it seems like it is requiring a wordlist already in RapidMiner format, which normally can be generated only from another Process Documents operator.  Of course it would be possible to work around this by putting the desired wordlist as text into one Process Documents operator merely to generate the wordlist to feed another Process Documents operator, but this seems somewhat inefficient and I am wondering if there is a more direct path.
    @nsmith see my comments above regarding the wordlist input.  It may be that you need to generate your wordlist first.  Regarding the pdfs, you can use Process Documents from Files and then set your parameters to read your pdf files from your hard drive.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,636  RM Data Scientist
    I think there is no way to generate a word list. Keep in mind that the wordlist contains also normalization factors for TF-IDF etc.But I think we can just do the full Occurrence matrix here and filter the attributes later for the ones we are interested? Alternatively you can just use Filter Token by Example Set in Process Documents.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    nsmith
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,530   Unicorn
    @mschmitz thanks, yes Filter Tokens by Exampleset should have the equivalent effect.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    nsmith
  • nsmithnsmith Member Posts: 5 Learner I

    @mschmitz @Telcontar120 thank you very much for your answers, it's nearly working now! :)

    Unfortunately there is still one problem with the "Filter Tokens Using ExampleSet" operator. I want to filter with my word list, which has two kinds of words.

    1. Single words (like "digital")
    2. Terms with two or more words (like "digital products")

    In general it's working as I used the "Generate n-gramms" operator before. Thus all stand-alone words and terms I specified are in the result list. The problem is that the operator generates also terms, which I did not exactly mention in  the word list. An example is "accelerating_digital". Even though I did not have this term in my word list, I want to have it in my result list as it contains the word "digital" (which is in my word list). 

    Is there a way to solve this problem?

     


  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,530   Unicorn
    If you change the order of your operators you should be able to resolve this. You may need to redo some work in that you would filter the text using your word list first, then generate the resulting word vector, then use the Generate n-grams operator to build the combinations after that.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • nsmithnsmith Member Posts: 5 Learner I
    edited September 16
    Thank you so much for your fast answer @Telcontar120 ! I tried a few possibilites and changed the operators, but it doesn't really work. I'm rather getting no result in the result list or I'm getting results but by proving them I realize that not every word which is in the word list and the text is shown in the result list.
    This is an example of a process I tried:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.7.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="loop_files" compatibility="9.7.002" expanded="true" height="103" name="Loop Files" width="90" x="112" y="34">
            <parameter key="directory" value="C:/Users/Nina Schmidt/Documents/Master/Masterarbeit/Geschäftsberichte/Konsumgüter und Handel/2019"/>
            <parameter key="filtered_string" value="file name (last part of the path)"/>
            <parameter key="file_name_macro" value="file_name_TEST"/>
            <parameter key="file_path_macro" value="file_path"/>
            <parameter key="parent_path_macro" value="parent_path"/>
            <parameter key="recursive" value="false"/>
            <parameter key="iterate_over_files" value="true"/>
            <parameter key="iterate_over_subdirs" value="false"/>
            <process expanded="true">
              <operator activated="true" class="text:read_document" compatibility="9.3.001" expanded="true" height="68" name="Read Document" width="90" x="45" y="34">
                <parameter key="extract_text_only" value="true"/>
                <parameter key="use_file_extension_as_type" value="true"/>
                <parameter key="content_type" value="pdf"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <operator activated="true" class="text:process_documents" compatibility="9.3.001" expanded="true" height="103" name="Process Documents" width="90" x="246" y="34">
                <parameter key="create_word_vector" value="true"/>
                <parameter key="vector_creation" value="TF-IDF"/>
                <parameter key="add_meta_information" value="true"/>
                <parameter key="keep_text" value="false"/>
                <parameter key="prune_method" value="absolute"/>
                <parameter key="prune_below_percent" value="3.0"/>
                <parameter key="prune_above_percent" value="30.0"/>
                <parameter key="prune_below_absolute" value="1"/>
                <parameter key="prune_above_absolute" value="9999"/>
                <parameter key="prune_below_rank" value="5.0"/>
                <parameter key="prune_above_rank" value="5.0"/>
                <parameter key="datamanagement" value="double_sparse_array"/>
                <parameter key="data_management" value="auto"/>
                <process expanded="true">
                  <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="45" y="85">
                    <parameter key="mode" value="non letters"/>
                    <parameter key="characters" value=".:"/>
                    <parameter key="language" value="English"/>
                    <parameter key="max_token_length" value="3"/>
                  </operator>
                  <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="85">
                    <parameter key="transform_to" value="lower case"/>
                  </operator>
                  <operator activated="true" class="text:filter_stopwords_english" compatibility="9.3.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="85"/>
                  <operator activated="true" class="text:filter_by_length" compatibility="9.3.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="85">
                    <parameter key="min_chars" value="2"/>
                    <parameter key="max_chars" value="30"/>
                  </operator>
                  <operator activated="true" class="retrieve" compatibility="9.7.002" expanded="true" height="68" name="Retrieve wortliste final_final" width="90" x="313" y="238">
                    <parameter key="repository_entry" value="data/wortliste final_final"/>
                  </operator>
                  <operator activated="true" class="operator_toolbox:filter_tokens_using_exampleset" compatibility="2.6.000" expanded="true" height="82" name="Filter Tokens Using ExampleSet" width="90" x="648" y="187">
                    <parameter key="attribute" value="att1"/>
                    <parameter key="case_sensitive" value="false"/>
                    <parameter key="invert_filter" value="true"/>
                  </operator>
                  <connect from_port="document" to_op="Tokenize" to_port="document"/>
                  <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
                  <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
                  <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
                  <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens Using ExampleSet" to_port="document"/>
                  <connect from_op="Retrieve wortliste final_final" from_port="output" to_op="Filter Tokens Using ExampleSet" to_port="example set"/>
                  <connect from_op="Filter Tokens Using ExampleSet" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="text:process_documents" compatibility="9.3.001" expanded="true" height="82" name="Process Documents (2)" width="90" x="380" y="85">
                <parameter key="create_word_vector" value="true"/>
                <parameter key="vector_creation" value="TF-IDF"/>
                <parameter key="add_meta_information" value="true"/>
                <parameter key="keep_text" value="false"/>
                <parameter key="prune_method" value="none"/>
                <parameter key="prune_below_percent" value="3.0"/>
                <parameter key="prune_above_percent" value="30.0"/>
                <parameter key="prune_below_rank" value="0.05"/>
                <parameter key="prune_above_rank" value="0.95"/>
                <parameter key="datamanagement" value="double_sparse_array"/>
                <parameter key="data_management" value="auto"/>
                <process expanded="true">
                  <operator activated="true" class="text:generate_n_grams_terms" compatibility="9.3.001" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="313" y="136">
                    <parameter key="max_length" value="2"/>
                  </operator>
                  <connect from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
                  <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="text:wordlist_to_data" compatibility="9.3.001" expanded="true" height="82" name="WordList to Data" width="90" x="581" y="85"/>
              <connect from_port="file object" to_op="Read Document" to_port="file"/>
              <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
              <connect from_op="Process Documents" from_port="word list" to_op="Process Documents (2)" to_port="word list"/>
              <connect from_op="Process Documents (2)" from_port="word list" to_op="WordList to Data" to_port="word list"/>
              <connect from_op="WordList to Data" from_port="word list" to_port="out 1"/>
              <connect from_op="WordList to Data" from_port="example set" to_port="out 2"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
              <portSpacing port="sink_out 3" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="9.7.002" expanded="true" height="82" name="Append" width="90" x="313" y="85">
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="merge_type" value="all"/>
          </operator>
          <operator activated="true" class="write_excel" compatibility="9.7.002" expanded="true" height="103" name="Write Excel (2)" width="90" x="447" y="136">
            <parameter key="file_format" value="xlsx"/>
            <enumeration key="sheet_names"/>
            <parameter key="sheet_name" value="RapidMiner Data"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="number_format" value="#.0"/>
            <parameter key="encoding" value="SYSTEM"/>
          </operator>
          <operator activated="true" class="write_file" compatibility="9.7.002" expanded="true" height="68" name="Write File" width="90" x="648" y="187">
            <parameter key="resource_type" value="file"/>
            <parameter key="filename" value="C:/Users/Nina Schmidt/Documents/Master/Masterarbeit/Analyse Geschäftsberichte/Konsumgüter und Handel/Konsumgüter und Handel 2019.xlsx"/>
            <parameter key="mime_type" value="application/octet-stream"/>
          </operator>
          <connect from_op="Loop Files" from_port="out 2" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Write Excel (2)" to_port="input"/>
          <connect from_op="Write Excel (2)" from_port="file" to_op="Write File" to_port="file"/>
          <connect from_op="Write Excel (2)" from_port="through" to_port="result 1"/>
          <connect from_op="Write File" from_port="file" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>



      I also tried to place the "generate n-gramms" operator at the end of the same "process documents" operator as the "filter tokens" operator is. Nothing really worked so far.
Sign In or Register to comment.