The Write Text operator does't work

noorhafhizahnoorhafhizah Member Posts: 2 Contributor I
edited November 2018 in Help
Good Day all.

I try to preprocessing text from CSV file to .txt file.  However, I couldn't get the output written in the .txt file.  If it does, it only rewrites the input file.

However, the good thing is I could get the output at the stdout for logging results.

Here I attach my XML code.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
    <parameter key="logverbosity" value="status"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true" height="370" width="820">
      <operator activated="true" class="text:read_document" compatibility="5.2.002" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
        <parameter key="file" value="C:\Users\user\Desktop\product2e.csv"/>
        <parameter key="extract_text_only" value="true"/>
        <parameter key="use_file_extension_as_type" value="true"/>
        <parameter key="content_type" value="txt"/>
        <parameter key="encoding" value="SYSTEM"/>
      </operator>
      <operator activated="true" class="text:tokenize" compatibility="5.2.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="120">
        <parameter key="mode" value="non letters"/>
        <parameter key="characters" value=".:"/>
        <parameter key="language" value="English"/>
        <parameter key="max_token_length" value="3"/>
      </operator>
      <operator activated="true" class="text:transform_cases" compatibility="5.2.002" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="120">
        <parameter key="transform_to" value="lower case"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="120"/>
      <operator activated="true" class="write_as_text" compatibility="5.2.006" expanded="true" height="76" name="Write as Text" width="90" x="581" y="120">
        <parameter key="result_file" value="C:\Users\user\Desktop\result1.txt"/>
        <parameter key="encoding" value="SYSTEM"/>
      </operator>
      <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
      <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
      <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Write as Text" to_port="input 1"/>
      <connect from_op="Write as Text" from_port="input 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="234"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
and one more thing is how am I going to write the output line by line? At the stdout, it writes all the words together without separate.

Thanks all.

Fizah.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Which kind of output do you expect? After all, an example set is a tabular structure, and the most natural text representation is something like the csv format.

    Best, Marius
  • noorhafhizahnoorhafhizah Member Posts: 2 Contributor I
    Thanks for the reply.

    The example of input:
    A001 Couldn't be more pleased with the purchase great camera great price.
    A002 This is a great camera for its price and worth it!
    A003   An excellent camera. Should be able to be picked up for less than $300 soon. .
    A004  Get a good one do not waste your money.
    and the example of output:
    more pleased purchase great camera great price
    great camera price worth
    excellent camera picked up for less soon
    get  good one waste your money

    Is it possible to do like the example of output and write it in the txt file?


    Cheers,
    Fizah
  • ayaRizkayaRizk Member Posts: 6 Contributor II
    edited February 2023
    This seems to be a very old problem but I still encounter the same issue. I have annual reports in pdf format, and after a lot of text processing steps, I get the desired output in the split screen in the results tab, but "Write as text" writes the original text rather than the processed one. Any clue? Perhaps @MartinLiebig can help?

    Attaching both my process and snapshot of results tab...

    Thanks,
    Aya
    <?xml version="1.0" encoding="UTF-8"?><process version="10.0.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="10.0.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:loop_files" compatibility="10.0.000" expanded="true" height="82" name="Loop Files" width="90" x="45" y="34">
            <parameter key="directory" value="/Users/ayari88/Documents/Research/AFA/ROBOT/Kommuners AR"/>
            <parameter key="filter_type" value="glob"/>
            <parameter key="filter_by_regex" value=".*\.docx$"/>
            <parameter key="recursive" value="true"/>
            <parameter key="skip_inaccessible" value="true"/>
            <parameter key="enable_macros" value="false"/>
            <parameter key="macro_for_file_name" value="file_name"/>
            <parameter key="macro_for_file_type" value="file_type"/>
            <parameter key="macro_for_folder_name" value="folder_name"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="handle_exception" compatibility="10.0.000" expanded="true" height="82" name="Handle Exception" width="90" x="179" y="34">
                <parameter key="add_details_to_log" value="true"/>
                <process expanded="true">
                  <operator activated="true" class="text:read_document" compatibility="10.0.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="34">
                    <parameter key="extract_text_only" value="true"/>
                    <parameter key="use_file_extension_as_type" value="true"/>
                    <parameter key="content_type" value="pdf"/>
                    <parameter key="encoding" value="SYSTEM"/>
                  </operator>
                  <connect from_port="in 1" to_op="Read Document" to_port="file"/>
                  <connect from_op="Read Document" from_port="output" to_port="out 1"/>
                  <portSpacing port="source_in 1" spacing="0"/>
                  <portSpacing port="source_in 2" spacing="0"/>
                  <portSpacing port="sink_out 1" spacing="0"/>
                  <portSpacing port="sink_out 2" spacing="0"/>
                </process>
                <process expanded="true">
                  <portSpacing port="source_in 1" spacing="0"/>
                  <portSpacing port="source_in 2" spacing="0"/>
                  <portSpacing port="sink_out 1" spacing="0"/>
                  <portSpacing port="sink_out 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="file object" to_op="Handle Exception" to_port="in 1"/>
              <connect from_op="Handle Exception" from_port="out 1" to_port="output 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="10.0.000" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="10.0.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <operator activated="true" class="text:transform_cases" compatibility="10.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34">
                <parameter key="transform_to" value="lower case"/>
              </operator>
              <operator activated="true" class="text:filter_by_length" compatibility="10.0.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="313" y="34">
                <parameter key="min_chars" value="3"/>
                <parameter key="max_chars" value="30"/>
              </operator>
              <operator activated="true" class="open_file" compatibility="10.0.000" expanded="true" height="68" name="Open File" width="90" x="313" y="289">
                <parameter key="resource_type" value="file"/>
                <parameter key="filename" value="/Users/ayari88/Documents/Research/AFA/ROBOT/RapidMiner/Custom_stopwords_ar.csv"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="10.0.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="447" y="187">
                <parameter key="case_sensitive" value="false"/>
                <parameter key="encoding" value="UTF-8"/>
              </operator>
              <operator activated="true" class="text:stem_snowball" compatibility="10.0.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="581" y="34">
                <parameter key="language" value="Swedish"/>
              </operator>
              <connect from_port="single" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
              <connect from_op="Open File" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/>
              <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
              <connect from_op="Stem (Snowball)" from_port="document" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="10.0.000" expanded="true" height="82" name="Write files" width="90" x="313" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="false" class="text:write_document" compatibility="10.0.000" expanded="true" height="82" name="Write Document" width="90" x="112" y="238">
                <parameter key="file" value="/Users/ayari88/Documents/Research/AFA/ROBOT/Kommuners AR preprocessed/%{a}.txt"/>
                <parameter key="overwrite" value="true"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <operator activated="true" class="write_as_text" compatibility="10.0.000" expanded="true" height="82" name="Write as Text" width="90" x="380" y="34">
                <parameter key="result_file" value="/Users/ayari88/Documents/Research/AFA/ROBOT/Kommuners AR preprocessed/%{a}.txt"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <connect from_port="single" to_op="Write as Text" to_port="input 1"/>
              <connect from_op="Write as Text" from_port="input 1" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="false" class="r_scripting:execute_r" compatibility="9.6.000" expanded="true" height="82" name="Execute R" width="90" x="380" y="289">
            <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;install.packages(&quot;seededlda&quot;)&#10;install.packages(&quot;quanteda&quot;)&#10;&#10;library(readtext)&#10;library(quanteda)&#10;library(seededlda)&#10;&#10;&#10;rm_main = function(data)&#10;{&#10;&#10;# create a dictionary of seed topics&#10;seeds_dict &lt;- dictionary(list(robotisering = c(&quot;robotisering&quot;, &quot;robot&quot;, &quot;rpa&quot;),&#10;                        automatisering = c(&quot;automatisering&quot;, &quot;automation&quot;, &quot;automat*&quot;),&#10;                        artificiell_intelligens = c(&quot;artificiell intelligens&quot;, &quot;ai&quot;, &quot;maskininlärning&quot;)))&#10;#create dfm from data&#10;key_dfm &lt;- dfm(data)&#10;&#10;# run seededLDA on the matrix&#10;seeded &lt;- textmodel_seededlda(&#10;  key_dfm, &#10;  k = 4,&#10;  seeds_dict, &#10;  case_insensitive = TRUE,&#10;  max_iter = 1000)&#10;&#10; # connect 2 output ports to see the results&#10;return(terms(seeded, n = 10))&#10;&#10;   &#10;}&#10;"/>
            <parameter key="use_default_R" value="true"/>
            <parameter key="Rscript_executable" value="Rscript"/>
            <parameter key="use_default_R_LIBS_paths" value="true"/>
            <enumeration key="R_LIBS_paths"/>
          </operator>
          <operator activated="false" class="operator_toolbox:group_into_collection" compatibility="2.14.000" expanded="true" height="82" name="Group Into Collection" width="90" x="514" y="289">
            <parameter key="group_by_attribute" value="topicId"/>
            <parameter key="group_by_attribute (numerical)" value="topicId"/>
            <parameter key="sorting_order" value="numerical"/>
          </operator>
          <connect from_op="Loop Files" from_port="output 1" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_op="Write files" to_port="collection"/>
          <connect from_op="Write files" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


Sign In or Register to comment.