Read PDF Tables Extension - Need to

mikedmiked Member Posts: 21 Contributor II
edited February 2020 in Help
Hello - I am trying to use the "Read PDF Tables" Extension. I have successfully read my PDF but it has been split out into 21 different example sets. I would like to use the "Select" operator to choose the Example sets that I need. I am running into some issues. "Select" only lets you pick on example set whereas I will need to select 5. Second - not all of the example sets are the same with only 5 of the 21 sheets having the attribute headings that I actually need. Would anyone have any ideas on how I can pull what I need from this set. I have been trying to use Loops but unsuccessfully. Thanks! 
Tagged:

Best Answers

  • mikedmiked Member Posts: 21 Contributor II
    Solution Accepted
    Hi @sgenzer...Great thank you. That definitely helps narrow down which example sets have the attributes that I need. Would I then just follow @varunm1 method to connect the n amount of "Select" operators to Append the sets together? Is there a way of using a macro to count the example sets and just save "Select" loop n amount of times. If not..this should work for now and I thank you both for your help. 
    -Mike

Answers

  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited February 2020
    Hello @miked

    Did you try using "multiply operator" after the collection and then connect the five select operators to pick each one of them based on their index in the collection? If all 5 have the same attribute names you can use append operator to append them into a single example set as well.

    There may be some other solutions as well. @David_A or @mschmitz any ideas here?
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • mikedmiked Member Posts: 21 Contributor II
    @varunm1
    Thanks for the suggestion. That would definitely work for now. I think what I'm looking for is a bit more automation. My fear is that it won't always be the same 5 example sets. I was hoping for some way to identify which of those example sets has the attributes that I am looking for and pull those sets regardless of how many there are. 
    -Mike
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hi Mike,

    Yep understood. Lets see if anyone responds 
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @miked I have worked with this situation before. I usually use "Loop Collection" afterwards and then check out each ExampleSet to see if has the attributes I'm looking for. Something like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="pdf_table_extraction:pdfs2exampleset_operator" compatibility="0.2.001" expanded="true" height="68" name="Read PDF Tables" width="90" x="112" y="136">
            <parameter key="resource_type" value="file"/>
            <parameter key="attribute" value=""/>
            <parameter key="tune extraction criteria" value="false"/>
            <parameter key="discard tables with no rows" value="false"/>
            <parameter key="discard empty attributes" value="false"/>
            <parameter key="heuristic ratio for table content" value="0.65"/>
            <parameter key="tune edge detection criteria" value="false"/>
            <parameter key="grayscale intensity threshold" value="25"/>
            <parameter key="minimum width of horizontal edge" value="50"/>
            <parameter key="minimum height of vertical edge" value="10"/>
            <parameter key="maximum cell corner distance" value="10"/>
            <parameter key="required text lines for edge" value="4"/>
            <parameter key="required cells for table" value="4"/>
            <parameter key="point snap distance threshold" value="8.0"/>
            <parameter key="table padding amount" value="1.0"/>
            <parameter key="identical table overlap ratio" value="0.9"/>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="246" y="136">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <description align="center" color="transparent" colored="false" width="126">enter the attribute of example sets you want to keep</description>
              </operator>
              <operator activated="true" class="branch" compatibility="9.6.000" expanded="true" height="82" name="Branch" width="90" x="179" y="34">
                <parameter key="condition_type" value="min_attributes"/>
                <parameter key="condition_value" value="1"/>
                <parameter key="expression" value=""/>
                <parameter key="io_object" value="ANOVAMatrix"/>
                <parameter key="return_inner_output" value="true"/>
                <process expanded="true">
                  <connect from_port="condition" to_port="input 1"/>
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                  <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="132" y="13">keep the ExampleSet</description>
                </process>
                <process expanded="true">
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                  <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="162" y="13">do not keep the ExampleSet</description>
                </process>
                <description align="center" color="transparent" colored="false" width="126">branch to some minimum # of attributes (1?)</description>
              </operator>
              <connect from_port="single" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Branch" to_port="condition"/>
              <connect from_op="Branch" from_port="input 1" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read PDF Tables" from_port="collection of pdf data tables as example sets" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    



  • mikedmiked Member Posts: 21 Contributor II
    @sgenzer
    That's fantastic thank you all!
    Two supplemental questions but not vital to solving the issue. 
    1 - I had 3 attributes that did not come through in the loop->select attributes so decided to just go with "all"..Two of the column headers is labeled in the PDF as "CurrentMonth's Sale" as well as "CYTD 2019" so assuming there are some limits to what Read PDF can do to as  @ey stated above?
    2 - If the example sets were not all the same...can I manipulate them in the collection or is it better to use "branch" and pull them out. 
    I'm a bit of a newbie especially with "Collections." I really appreciate the help of the group here. 
    -Mike
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @miked glad everything is working for you! It's hard to answer your new questions here without really seeing some examples. There are some limitations to the Read PDF Tables operator - mostly because PDF tables come in a ton of different shapes and sizes.
Sign In or Register to comment.