🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

The contents readed by read document operator from PDF file are disorder

LeiLei Member Posts: 2 Newbie
I use read document operator to read a PDF file. The content order is different from the text order of the PDF file itself.

<?xml version="1.0" encoding="UTF-8"?><process version="9.10.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.10.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="text:read_document" compatibility="9.3.001" expanded="true" height="68" name="Read Document" width="90" x="112" y="85">
        <parameter key="file" value="D:/water electrolysis/PDF electrolyte/0019178499_#14. 1-s2.0-S0925838817304759-main1.pdf"/>
        <parameter key="extract_text_only" value="true"/>
        <parameter key="use_file_extension_as_type" value="true"/>
        <parameter key="content_type" value="txt"/>
        <parameter key="encoding" value="SYSTEM"/>
      </operator>
      <connect from_op="Read Document" from_port="output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Extracted by read document operator: 


Original PDF file:



 The order of highlight words in above two images is totally different. 
Could someone help me to fix this problem? Thank you very much.
Sign In or Register to comment.