AMOUNT OF EXAMPLES DOES NOT CORRELATES WITH INPUT DATA LOADED FROM PDFs

antonio_heredia · July 2018

on="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="all"/>
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="136">
        <list key="text_directories">
          <parameter key="Forging vs AM" value="C:\Users\xwb15193\Desktop\L.R AM vs F\ScienceDirect\ScienceDirect_articles_04Jul2018_11-57-34.507"/>
        </list>
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
            <parameter key="mode" value="linguistic tokens"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="246" y="34"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
          <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="77" y="85">Type your comment</description>
          <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="99" y="169">Type your comment</description>
        </process>
      </operator>
      <operator activated="false" class="filter_examples" compatibility="8.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="289">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="label.contains.and"/>
        </list>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I tried to tokenize pdf articles, resulting in only 21 examples. Why does it happen? It should outcome many more. To do so, I used: "Process data from files" and inside I included "Tokenize" and "filter stopwords", Which again works but not throughout all the documents. What should I do to fix it?

Cheers,

Antonio

lionelderkrikor · July 2018

Hi @antonio_heredia,

Do you have a lot of files ?

Can your share these files in order we can reproduce what you observe ?

Regards,

Lionel

NB : The first line of your XML process is broken, however I was able to repair it.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

AMOUNT OF EXAMPLES DOES NOT CORRELATES WITH INPUT DATA LOADED FROM PDFs

Answers