Options

"python's subprocess.run() not working inside Rapidminer"

lplenkalplenka Member Posts: 11 Contributor II
edited June 2019 in Help

Hello friends, I am in a bit of trouble with Python's subprocess.run() inside the Execute Python operator. I am using Xpd Reader's pdftotext to extract text from a pdf file. It seems that the subprocess  fails when I run the process, as I always get a blank text file. 

 

System Details:-

Windows 10

RapidMiner Studio 8.0

Python 3.6

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="380" y="187">
<parameter key="script" value="import pandas&#10;import sys&#10;import subprocess&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main():&#10;&#10; def pdf_text(source, output, timeout=None):&#10; &#10; if sys.platform == &quot;win32&quot;:&#10; args = ['pdftotext', '-simple', source, output]&#10; elif sys.platform == &quot;linux&quot; or sys.platform == &quot;linux2&quot;:&#10; args = ['pdftotext', '-layout', source, output]&#10; &#10; with open(output,&quot;w+&quot;):&#10; process = subprocess.run(&#10; args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout, shell = True)&#10; &#10; &#10; &#10; &#10; input_file = &quot;D:/pdf-sample.pdf&quot;&#10; output_file = &quot;D:/ouput.txt&quot;&#10; pdf_text(input_file, output_file)&#10; &#10; return "/>
</operator>
</process>

I am unable to find any reason for the wrong output. Please help!

Tagged:

Best Answer

  • Options
    lplenkalplenka Member Posts: 11 Contributor II
    Solution Accepted

    Hey @lionelderkrikor,

     

    Thanks for trying to help.

    Sorry the previous xml  file was  having some error. This is the new xml file. 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="380" y="187">
    <parameter key="script" value="import pandas&#10;import sys&#10;import subprocess&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main():&#10;&#10; def pdf_text(source, output, timeout=None):&#10; &#10; if sys.platform == &quot;win32&quot;:&#10; args = ['pdftotext', '-simple', source, output]&#10; elif sys.platform == &quot;linux&quot; or sys.platform == &quot;linux2&quot;:&#10; args = ['pdftotext', '-layout', source, output]&#10; &#10; with open(output,&quot;w+&quot;):&#10; process = subprocess.run(&#10; args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout, shell = True)&#10; &#10; &#10; &#10; &#10; input_file = &quot;D:/pdf-sample.pdf&quot;&#10; output_file = &quot;D:/ouput.txt&quot;&#10; pdf_text(input_file, output_file)&#10; &#10; return "/>
    </operator>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    </operator>
    </process>

    Well, yes the python script works fine when I run in a notebook or calling the python script from cmd. 

    I am not taking any arguments in rm_main() because this script doesn't need any and I want the text to be extracted to "output.txt" in my D: drive. So no return statements also.

     

     

    Note:

    Surprisingly, I am getting the extracted text in "output.txt" text file now. I don't know why I was not getting output last night.  Did the restart do the trick? Please cross-check in your system.  Thank You :) 

Answers

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @lplenka,

     

    First, it seems that there is an error in the XML code, you shared : It can be loaded in RapidMiner. Maybe this code is incomplete : 

    click in the XML panel, then Ctrl + A, Ctrl + C (to copy the whole process) and then paste it. 

     

    1. For the python code to be executed, you have to use the function rm_main : In your case rm_main has no argument in entry  - def rm_main()  - and you define instead an other function : def pdf_text() .

    2. I see too that the function rm_main()  return any output  : return........

    3.  Have you try to run your code in a Notebook ?

     

    Regards, 

     

    Lionel

     

     

     

     

     

     

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi again @lplenka,

     

    It's just to report that if you want extract text from a .pdf file, you can use the "Text Processing" extension of RapidMiner.

    Maybe you can use the operators of this extension to perform what you want.

    Here a useful link : 

    https://community.rapidminer.com/t5/Getting-Started-Knowledge-Base/Keyword-Frequency-in-Text-Mining/ta-p/31618

     

    Regards, 

     

    Lionel

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @lplenka,

     

    In my case, the output.txt file is empty after running the Execute Python operator with your process.

    However, to complete my last post, you can perform this operation of text extraction with the Read Document and 

    Write Document operator of the Text Processing extension.

     

    Here the process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="Execute Python" width="90" x="380" y="187">
    <parameter key="script" value="import pandas&#10;import sys&#10;import subprocess&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main():&#10;&#10; def pdf_text(source, output, timeout=None):&#10; &#10; if sys.platform == &quot;win32&quot;:&#10; args = ['pdftotext', '-simple', source, output]&#10; elif sys.platform == &quot;linux&quot; or sys.platform == &quot;linux2&quot;:&#10; args = ['pdftotext', '-layout', source, output]&#10; &#10; with open(output,&quot;w+&quot;):&#10; process = subprocess.run(&#10; args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout, shell = True)&#10; &#10; &#10; &#10; &#10; input_file = &quot;C:/Users/Lionel/Documents/Formations_DataScience/Rapidminer/Tests_Rapidminer/Extract_text_python/pdf-sample.pdf&quot;&#10; output_file = &quot;C:/Users/Lionel/Documents/Formations_DataScience/Rapidminer/Tests_Rapidminer/Extract_text_python/ouput.txt&quot;&#10; pdf_text(input_file, output_file)&#10; &#10; return "/>
    </operator>
    <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="380" y="340">
    <parameter key="file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Extract_text_python\pdf-sample.pdf"/>
    </operator>
    <operator activated="true" class="text:write_document" compatibility="7.5.000" expanded="true" height="82" name="Write Document" width="90" x="514" y="340">
    <parameter key="file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Extract_text_python\output.txt"/>
    </operator>
    <connect from_op="Read Document" from_port="output" to_op="Write Document" to_port="document"/>
    <connect from_op="Write Document" from_port="document" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Best regards, 

     

    Lionel

  • Options
    lplenkalplenka Member Posts: 11 Contributor II

    Thanks @lionelderkrikor for the help.

    Will use the textmining operator from next time.

     

    bdw you can restart your system  and probabbly my process will start producing perfect result. This is just a hypothesis that worked in my case.

     

    Thanks for all help :)

Sign In or Register to comment.