RapidMiner

0 Likes

WordList to Data gives me example set with no attribute names

Status: Investigating

I'm running Process Documents to get a word list which I then convert to data using WordList to Data. All goes well until I try to select, filter or otherwise use the dataset thus created. I cannot see any attribute names in the data. I can manually type them in (e.g. in Select Attributes, but not all operators allow this), but subsequent operators in the chain do not see the names.

I have tried Synchronize Metadata and Materiaize Data, but neither of these helps.

In the example, the final Select Attributes operator does not allow me to select attributes from a list, although I can type "word" into the box when the filter type is "single".

Any ideas?

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.6.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
        <list key="attribute_values">
          <parameter key="text" value="&quot;hello world out there&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="TF-IDF"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="select_attributes_and_weights" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="85">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="7.5.000" expanded="true" height="82" name="WordList to Data" width="90" x="581" y="34"/>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="715" y="34">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Go to Solution.
8 Comments (8 New)
Comments
RM Certified Expert

Yes, this is an annoying RapidMiner bug.  It's purely a metadata propagation problem.  If you tried to use the "Synchronize Meta Data with Real Data" option but it didn't solve the problem, you can still add the attributes you want manually when selecting the "subset" option.  Just type the name of the attributes into the box in the right hand side and then click the green plus button (see screenshot).  They won't autopopulate but they will actually work when you run the process.

select attributes.PNG

Contributor I

Thanks - glad it's not just me!

Given that it's a bug, I've got a workaround which is to generate new attributes (equal to the existing attributes) after the dataset conversion. I can then select the new attributes as per normal and throw away the old, hidden attributes. A bit of a faff, but as the issue propagates all through the workflow, it's easier this way.

RM Certified Expert

@mizunooto I'd love to see the process you are using to do that---I am assuming you are doing it in some kind of automated fashion and not just typing in all the new attribute names manually?

 

Contributor I

Imagination is a wonderful thing... at present I'm only using a couple of attributes so a manual operation is fine.

e.g. newword = word

Automation sounds like a great idea but I'll need to learn more to make it work.

Community Manager

yes that's a metadata propagation error for sure. Pushing this thread to Product Feedback.


Scott

 

Community Manager
Status: Investigating

Metadata problem recognized and reported to dev team. Please watch this board for updates.


SG

 

 

Community Manager

just to confirm with people watching this issue - toggling the "Synchronize Metadata with Real Data" feature does not resolve this, correct?

RM Certified Expert

Correct, the "Sychronize Metadata" option doesn't fix this particular problem.  I've found that it doesn't correct it 100% of the time, there are a few other cases where I have found similar behavior.