"Import data PDF documents"

nina_ploetzl · November 2018

Hi there!
I'm completely new to rapid miner - and can't manage to import PDF files into the repository.
It says that it's an unknown file type. I'm sorry for the completely (!) basic question, but I can't find anything about that in the getting started training.

Thank you very much for your help!

MarcoBarradas · November 2018

Hi what do you want to do with the PDF? I guess you are goin to try to do some text minning with them or you will try to extract some table data from them.

You will need yo install the text minning extensions. In order to do so you need to open Extension->MarketPlace and serch for the extension.

In case you need to extract tables alse install data table extraction.

After that you need to build a process that suits your need.

I'm posting and example

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Open a PDF" width="90" x="112" y="34"/>
      <operator activated="true" class="pdf_table_extraction:pdf2exampleset_operator" compatibility="0.2.001" expanded="true" height="68" name="Get tables from a PDF" width="90" x="112" y="136"/>
      <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="68" name="Use this one for more than One PDF" width="90" x="246" y="34">
        <parameter key="filter_type" value="regex"/>
        <parameter key="filter_by_regex" value="(?i).*pdf"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Open a PDF (2)" width="90" x="246" y="34"/>
          <connect from_port="file object" to_op="Open a PDF (2)" to_port="file"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

In order tu paste the xml code I just posted you will need to click on

View->Show Panel->XML you'll see a new view called XML remove the code and paste the one I gave you, Click on on the green check and then return to the process view.

For further steps you can follow this videos:

https://www.youtube.com/watch?v=ophGqpUexKI&list=PLssWC2d9JhOZLbQNZ80uOxLypglgWqbJA

MartinLiebig · November 2018

Hi @nina_ploetzl,

to add to @MarcoBarradas' fantastic comment: Read Document has an option to read the text of a PDF. It's also part of text mining extension.

BR,

Martin

nina_ploetzl · November 2018

Thank you very much @mschmitz, that's what I was looking for as a first to import the PDFs.
But I can just import them one by one. Is there any possibility to import hundreds at one time?

nina_ploetzl · November 2018

Thank you very much @MarcoBarradas for the extensive reply! I will try that but I'll need some time, like I said I'm completely new to rapid miner and I've no experience with data mining or anything related.

I wanna analysze a set of 1700 research articles, and I wanna kind of classify and analyze them which research method they use. So I wanna import these PDFs, and look for specific words in them and if they do contain these words I want to categorize them into groups...

MartinLiebig · November 2018

Hi @nina_ploetzl,

you can use loop files to iterate over folders and pass the full path to read documents.

For the categorisation: Have a look at Extract Topics from Documents. It's an operator which is part of the operator toolbox extension.

Best,

Martin

nina_ploetzl · November 2018

Hello @mschmitz!

Thank you so much for helping me.
I tried this version now, but somehow it doesn't work when I press run - and the tutorials on YouTube are with old versions that look different.

It says "not enough iterations" or that there is no output from the Loop Files operator...see the attached screenshot...

Best, Nina

Bildschirmfoto 2018-11-05 um 15.35.20.png

MartinLiebig · November 2018

Hi @nina_ploetzl,
you need to put the Read Document operator Into the Loop Files.
Further you want to use regex with .+ on the folder to catch all documents,

cheers,
Martin

nina_ploetzl · November 2018

Hello @mschmitz!

I converted my documents into text files now. And I used the Loop Files operator and managed to get an example set with all my text files - where every example is one txt document.

This is my XML code for this process which is working:

<?xml version="1.0" encoding="UTF-8"?>
<process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files" width="90" x="112" y="34">
        <parameter key="directory" value="/Users/ninaploetzl/Downloads/pdftotext"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="34"/>
          <operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="514" y="34">
            <parameter key="text_attribute" value="mytext"/>
            <parameter key="add_meta_information" value="false"/>
          </operator>
          <connect from_port="file object" to_op="Read Document" to_port="file"/>
          <connect from_op="Read Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>  
        <connect from_op="Documents to Data" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="9.0.003" expanded="true" height="82" name="Append" width="90" x="380" y="34"/>
      <connect from_op="Loop Files" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

-----> but now when I use the Operators Tokenize, Stem and Filter Stopwords in the inside process of Loop files --> it doesn't work anymore. The problem is at the "Documents to Data" operator. If I put a breakpoint before it works and tokenizes, stems, filters all the examples correctly. But if I put a breakpoint after this operator it just gives me the whole unreduced text. And if I look at the whole outcome of the process, it just gives me 1 example anymore. Not the 20 which I imported.

This is my process XML code with the operators tokenize, stem and filter stop words:

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">

</context>

</operator>

</operator>

</process>

</operator>

</process>

</operator>

</process>

Please help!!! Thank u so much!

nina_ploetzl · November 2018

These are a few examples if you wanna try...

MartinLiebig · November 2018

Hi @nina_ploetzl ,
i think you pointed me to an issue with our dataframe.. Can you check if the attached process does what you want to do?
BR,
Martin

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files" width="90" x="112" y="34">
<parameter key="directory" value="C:\Users\MartinSchmitz\Downloads\nina_txt"/>
<parameter key="filter_type" value="regex"/>
<parameter key="filter_by_regex" value=".+txt"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34">
<parameter key="characters" value="" ""/>
</operator>
<operator activated="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="380" y="136"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="136"/>
<operator activated="true" class="execute_script" compatibility="9.0.003" expanded="true" height="82" name="Execute Script" width="90" x="581" y="34">
<parameter key="script" value="import com.rapidminer.operator.text.*;
Document inputData = input[0];
StringBuilder buffer = new StringBuilder();

Iterator var2 = inputData.getTokenSequence().iterator();

while(var2.hasNext()) {
Token token = (Token)var2.next();
buffer.append(token.getToken());
buffer.append(" ");
}

return new Document(buffer.toString());"/>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="849" y="34">
<parameter key="text_attribute" value="mytext"/>
</operator>
<connect from_port="file object" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Execute Script" to_port="input 1"/>
<connect from_op="Execute Script" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<connect from_op="Loop Files" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

nina_ploetzl · November 2018

@mschmitz the code <?xml version="1.0" encoding="UTF-8"?> doesn't do anything if I copy it into my XML box?

MartinLiebig · November 2018

Sorry, my fault. It's attached

MarcoBarradas · November 2018

Hi Nina sorry for the delay on this post I created a new process that may help you with what you are trying to achieve. Now that I know a little more of what you need to do.
I took Martin xml and tweaked it a little.

The process I attached reads all the txt files from the directory you will set at text directories on the Process Documents from Files operator.
In the inner process I pasted your tokeniz, stem and Filter Stop word.
After the process finishes you will connect to the WordList to Data with this you will know how often a word is used and on how many documents it appears.

The other part will extract the text we got from each file (Select Attributes) and will convert each example to a document that will then be connected o the Extract Topics from Documents.

For the next step we will need Matins help since I don't know that well the extension but we are closer to the part where we will create a model that classifies each document.

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="85">
        <list key="text_directories">
          <parameter key="Files" value="C:\Users\mbarradas\Downloads\Files"/>
        </list>
        <parameter key="file_pattern" value="*.txt"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="34">
            <parameter key="characters" value="&quot; &quot;"/>
          </operator>
          <operator activated="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (2)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="514" y="34"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Opens Files from the Directory you set at text directories</description>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="8.1.000" expanded="true" height="82" name="WordList to Data" width="90" x="246" y="136"/>
      <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="380" y="34">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="operator_toolbox:lda" compatibility="1.5.000" expanded="true" height="124" name="Extract Topics from Document (LDA)" width="90" x="648" y="136"/>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_port="result 1"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Select Attributes" from_port="original" to_port="result 2"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Extract Topics from Document (LDA)" to_port="col"/>
      <connect from_op="Extract Topics from Document (LDA)" from_port="exa" to_port="result 3"/>
      <connect from_op="Extract Topics from Document (LDA)" from_port="top" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

nina_ploetzl · November 2018

Thank you @mschmitz! But when I run this process it says "not enough iterations"? And do you have any idea/suggestion how I can cluster my example set by similarity? I did some research and thought of k-means and something with the generate TFIDF because I need to cluster it based on similarity. As I'm new I'd really appreciate your advice what could be working with my data.

Best, Nina

nina_ploetzl · November 2018

Thanks @MarcoBarradas so much!

But when I run the process I have no outcome...

And when I put a Breakpoint at the Select Attributes operator (after the stemming, tokenizing and filtering stop words) --> in the results view of the examples there is still the whole text and not without stop words, etc...does this mean it hast tokenized/filtered etc. them?

Attached I send a screenshot of the result it gives me when I run the process. It's empty...

Thank you so much for your time and help!

Best, Nina

Image: https://us.v-cdn.net/6030995/uploads/editor/ui/l4c4cp16pob1.png

MarcoBarradas · November 2018

Nina your pic if from the Data to Documents output.
The TF-IDF vector is on the exa port output of the Process Documents From Files operator. Since you are trying to cluster documents I guess you can connect a cluster operator to that output and it may show you your first cluster.
I saw you tokenized on none letters so you are I don´t know if is a better idea to tokenize on linguistic sentences. @mschmitz, @Thomas_Ott , @IngoRM what would be your advice?

nina_ploetzl · November 2018

@MarcoBarradas does the tokenizing etc. work? because when I look at the results after the operator select attributes --> the text still is with stop words and the words are not stemmed. (see screenshot column text)

My fault, sorry, but if I run the whole process it says "Attribute already present. The attribute text was already present in the example set."

Is this a problem with the Extract Topics from Document operator?

Image: https://us.v-cdn.net/6030995/uploads/editor/hu/cpfpy48fbpo9.png

nina_ploetzl · November 2018

and @MarcoBarradas one question I have for your process built:

when I run the process until "Select Attributes" it counts 0 words for "audit", "auditor" at every example oft my data set

but when I run the process until "Wordlist to Data" it counts overall 1000-4000 words of these words - how is that possible?

Thank you so much for helping me!

MartinLiebig · November 2018

Hi @nina_ploetzl,
w.r.t the "text" but, somewhat yet. It's simply not allowed to have the attribute twice. Not sure how to overcome this though.
BR,
Martin

MarcoBarradas · November 2018

Hi Nina the reason you see a count of 0 on the select attribute is because when erm Frequency-Inverse Document Frequency (TF-IDF) is created the words are Term Frequency is normalized in order to adjsut the problem caused by the lenght of each text you are analyzing. If this wasn't done then a term may have more importance only because it was extracted from a larger document.
You can fin more about the subject at this link : https://www.commonlounge.com/discussion/99e86c9c15bb4d23a30b111b23e7b7b1

A workaround for the text attribute existence would be to rename the attribute
I did some tweaks to the previous version

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="85">
        <list key="text_directories">
          <parameter key="Files" value="C:\Users\mbarradas\Downloads\Files"/>
        </list>
        <parameter key="file_pattern" value="*.txt"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="34">
            <parameter key="characters" value="&quot; &quot;"/>
          </operator>
          <operator activated="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (2)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="514" y="34"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
          <connect from_op="Stem (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Opens Files from the Directory you set at text directories</description>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="8.1.000" expanded="true" height="82" name="WordList to Data" width="90" x="246" y="289"/>
      <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="rename" compatibility="8.2.000" expanded="true" height="82" name="Rename" width="90" x="246" y="187">
        <parameter key="old_name" value="text"/>
        <parameter key="new_name" value="words"/>
        <list key="rename_additional_attributes"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="136">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="words" value="1.0"/>
        </list>
      </operator>
      <operator activated="true" class="operator_toolbox:lda" compatibility="1.5.000" expanded="true" height="124" name="Extract Topics from Document (LDA)" width="90" x="581" y="238">
        <parameter key="number_of_topics" value="2"/>
      </operator>
      <operator activated="true" class="free_memory" compatibility="8.2.000" expanded="true" height="103" name="Free Memory" width="90" x="849" y="187"/>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_port="result 1"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="original" to_port="result 2"/>
      <connect from_op="Rename" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Extract Topics from Document (LDA)" to_port="col"/>
      <connect from_op="Extract Topics from Document (LDA)" from_port="exa" to_op="Free Memory" to_port="through 1"/>
      <connect from_op="Extract Topics from Document (LDA)" from_port="top" to_op="Free Memory" to_port="through 2"/>
      <connect from_op="Free Memory" from_port="through 1" to_port="result 3"/>
      <connect from_op="Free Memory" from_port="through 2" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

I don´t know if all the documents are in English but you should check for the word filtering so that terms like aa, aaa, aaahq wount apear. Also check the encoding o the files I don't know if this means anything or if it just an encoding error accuracyâ,acquã

nina_ploetzl · November 2018

@MarcoBarradas thank you for your help!!

I'm going with LDA suggestion from @mschmitz now. For all who are interested, in the following I'm posting the XML code:

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">

</context>

</process>

</operator>

</process>

</operator>

</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Import data PDF documents"

Best Answer

Answers