"[SOLVED] Processs documents from files"

ojuarezojuarez Member Posts: 5 Contributor II
edited June 2019 in Help
Hello Everyone.

I am trying to process several documents with the "Process Documents from files" operator.   In the first case all files where on the same directory and everything went perfect.  In the second case files are inside sub-folders so I didn´t get any results.

After investigating I am trying with the "Loop file"  Operator.   In the sub-process of the loop operator I have 2 more operators.

1. Provide Macro as log value
2. Process documents from files

I don´t get any errors but I don´t get any output either.   If I place a breakpoint after "Process documents from file" , I can see that it process the first directory correctly but still can get the output.

Here is an example:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.012">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.012" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="loop_files" compatibility="5.3.012" expanded="true" height="94" name="Loop Files" width="90" x="246" y="120">
       <parameter key="directory" value="C:\Users\ojuarez\httrack\Curacao\www.lacuracaonline.com\guatemala\productos\audio-y-video\televisores"/>
       <parameter key="recursive" value="true"/>
       <parameter key="iterate_over_subdirs" value="true"/>
       <process expanded="true">
         <operator activated="true" class="provide_macro_as_log_value" compatibility="5.3.012" expanded="true" height="94" name="Provide Macro as Log Value" width="90" x="112" y="120">
           <parameter key="macro_name" value="file_name"/>
         </operator>
         <operator activated="true" class="text:process_document_from_file" compatibility="5.3.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="179" y="345">
           <list key="text_directories">
             <parameter key="archivos" value="%{file_path}"/>
           </list>
           <parameter key="extract_text_only" value="false"/>
           <process expanded="true">
             <connect from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="log" compatibility="5.3.012" expanded="true" height="76" name="Log" width="90" x="514" y="120">
           <parameter key="filename" value="C:\Users\ojuarez\httrack\log_1"/>
           <list key="log">
             <parameter key="filename" value="operator.Provide Macro as Log Value.value.macro_value"/>
           </list>
         </operator>
         <connect from_op="Provide Macro as Log Value" from_port="through 1" to_op="Log" to_port="through 1"/>
         <connect from_op="Provide Macro as Log Value" from_port="through 2" to_op="Process Documents from Files" to_port="word list"/>
         <connect from_op="Process Documents from Files" from_port="example set" to_port="out 2"/>
         <connect from_op="Log" from_port="through 1" to_port="out 1"/>
         <portSpacing port="source_file object" spacing="0"/>
         <portSpacing port="source_in 1" spacing="0"/>
         <portSpacing port="sink_out 1" spacing="0"/>
         <portSpacing port="sink_out 2" spacing="252"/>
         <portSpacing port="sink_out 3" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Loop Files" from_port="out 1" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>






Answers

  • frasfras Member Posts: 93 Contributor II
    Hi,

    Please check the process attached. I changed
    operator to "Process Documents from Data". For text
    analysis you should activate some tokenizing inside.
    Happy mining,
    Frank

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros>
          <macro>
            <key>home</key>
            <value>C:\Users\fras</value>
          </macro>
        </macros>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <description>&lt;b&gt;
    Change macro "home" to your needs to store test data.

    &lt;/b&gt;</description>
        <process expanded="true">
          <operator activated="true" class="loop" compatibility="5.3.013" expanded="true" height="76" name="gen TestData" width="90" x="45" y="30">
            <parameter key="set_iteration_macro" value="true"/>
            <parameter key="iterations" value="10"/>
            <process expanded="true">
              <operator activated="true" class="generate_macro" compatibility="5.3.013" expanded="true" height="60" name="Generate Macro" width="90" x="45" y="30">
                <list key="function_descriptions">
                  <parameter key="numEx" value="ceil(10000 * rand())"/>
                </list>
              </operator>
              <operator activated="true" class="generate_sales_data" compatibility="5.3.013" expanded="true" height="60" name="Generate Sales Data" width="90" x="179" y="30">
                <parameter key="number_examples" value="%{numEx}"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="5.3.013" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="label"/>
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <operator activated="true" class="write_csv" compatibility="5.3.013" expanded="true" height="76" name="Write CSV" width="90" x="447" y="30">
                <parameter key="csv_file" value="%{home}\Desktop\demo\files\data-%{iteration}.csv"/>
              </operator>
              <connect from_op="Generate Sales Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Write CSV" to_port="input"/>
              <connect from_op="Write CSV" from_port="through" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_files" compatibility="5.3.013" expanded="true" height="76" name="Loop Files" width="90" x="45" y="165">
            <parameter key="directory" value="%{home}\Desktop\demo\files"/>
            <parameter key="iterate_over_subdirs" value="true"/>
            <process expanded="true">
              <operator activated="false" class="provide_macro_as_log_value" compatibility="5.3.013" expanded="true" height="60" name="Provide Macro as Log Value" width="90" x="45" y="30">
                <parameter key="macro_name" value="file_name"/>
              </operator>
              <operator activated="false" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="180" y="30">
                <list key="text_directories">
                  <parameter key="archivos" value="%{file_path}"/>
                </list>
                <parameter key="extract_text_only" value="false"/>
                <process expanded="true">
                  <connect from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="print_to_console" compatibility="5.3.013" expanded="true" height="60" name="Print to Console" width="90" x="315" y="30">
                <parameter key="log_value" value="File----&gt; %{file_name}, %{file_path}"/>
              </operator>
              <operator activated="true" class="text:read_document" compatibility="5.3.000" expanded="true" height="60" name="Read Document" width="90" x="447" y="30">
                <parameter key="file" value="%{file_path}"/>
                <parameter key="use_file_extension_as_type" value="false"/>
              </operator>
              <operator activated="true" class="text:documents_to_data" compatibility="5.3.000" expanded="true" height="76" name="Documents to Data" width="90" x="581" y="30">
                <parameter key="text_attribute" value="text_att"/>
                <parameter key="label_attribute" value="label_att"/>
              </operator>
              <connect from_op="Read Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
              <connect from_op="Documents to Data" from_port="example set" to_port="out 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="5.3.013" expanded="true" height="76" name="Append" width="90" x="179" y="165"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="165">
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="1500"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="false" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="179" y="120"/>
              <operator activated="false" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases" width="90" x="379" y="120"/>
              <connect from_port="document" to_port="document 1"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_macro" compatibility="5.3.013" expanded="true" height="60" name="set HOME" width="90" x="45" y="300">
            <parameter key="macro" value="home"/>
            <parameter key="value" value="C:\Users\fras"/>
          </operator>
          <connect from_op="Loop Files" from_port="out 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


  • ojuarezojuarez Member Posts: 5 Contributor II
    Thanks for your reply.  Today  I am away from the office, but trust me, it will be the first thing to try as soon as I get back.
  • ojuarezojuarez Member Posts: 5 Contributor II
    I tested your process and it worked.  What I was trying to achieve was somehow different, but with your example I figured where my problem was.

    I was missing the Append operator in the process top level, after the loop files operator.  Once I added it everything worked as expected.

    I am really grateful for your help, it was driving me crazy!
Sign In or Register to comment.