Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

[SEMI-SOLVED] Reading CSV file of unknown structure into purely nominal/text

tennenrishintennenrishin Member Posts: 177 Contributor II
edited October 2019 in Help
What is the easiest way to read a CSV file that has an unknown set (and number) of attributes (named in the first row), into an exampleset where each value is read simply as a nominal (or text) attribute?

My attempt,
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="112" y="30">
       <parameter key="csv_file" value="/blahblahblah/VTX.csv"/>
       <parameter key="column_separators" value=","/>
       <parameter key="parse_numbers" value="false"/>
       <list key="annotations"/>
       <list key="data_set_meta_data_information"/>
     </operator>
     <connect from_op="Read CSV" from_port="output" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
parses numeric-appearing data as numeric attributes.

Failing that, what is the easiest way to do it if the number of attributes is known (but not the names)?

My attempt:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="112" y="30">
       <parameter key="csv_file" value="/blahblahblah/VTX.csv"/>
       <parameter key="column_separators" value=","/>
       <parameter key="parse_numbers" value="false"/>
       <list key="annotations"/>
       <list key="data_set_meta_data_information">
         <parameter key="0" value=".true.nominal.regular"/>
         <parameter key="1" value=".true.nominal.regular"/>
         <parameter key="2" value=".true.nominal.regular"/>
         <parameter key="3" value=".true.nominal.regular"/>
         <parameter key="4" value=".true.nominal.regular"/>
         <parameter key="5" value=".true.nominal.regular"/>
       </list>
     </operator>
     <connect from_op="Read CSV" from_port="output" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
only reads the last attribute and discards the rest.

Answers

  • tennenrishintennenrishin Member Posts: 177 Contributor II
    Forgot to say please  ;D
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,996 RM Engineering
    Hi,

    if you just use the CSV operator as in your first example, you can simply follow it up with a "Numerical to Polynominal" operator, set to include all attributes. Or if you like, you can even follow that one up with a "Nominal to Text" operator. After that, all your attributes are of the type 'Text'.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="112" y="30">
           <parameter key="csv_file" value="/blahblahblah/VTX.csv"/>
           <parameter key="column_separators" value=","/>
           <parameter key="parse_numbers" value="false"/>
           <list key="annotations"/>
           <list key="data_set_meta_data_information"/>
         </operator>
         <operator activated="true" class="numerical_to_polynominal" compatibility="5.3.013" expanded="true" height="76" name="Numerical to Polynominal" width="90" x="246" y="30"/>
         <operator activated="true" class="nominal_to_text" compatibility="5.3.013" expanded="true" height="76" name="Nominal to Text" width="90" x="380" y="30"/>
         <connect from_op="Read CSV" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
         <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
         <connect from_op="Nominal to Text" from_port="example set output" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    Regards,
    Marco
  • tennenrishintennenrishin Member Posts: 177 Contributor II
    Thanks Marco,

    but then "00005" ends up as "5", for example. I need plain text original attributes, and I don't know their names at design time. This seems like a very basic requirement, or am I missing something obvious?

    Regards,
    Isak
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,996 RM Engineering
    Hi,

    unfortunately I think there is no out of the box way atm. I've modified your second process to at least do what you want:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="5.3.013" expanded="true" height="60" name="Read CSV" width="90" x="112" y="30">
            <parameter key="csv_file" value="/blahblahblah/VTX.csv"/>
            <parameter key="column_separators" value=","/>
            <parameter key="parse_numbers" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value=".true.nominal.attribute"/>
              <parameter key="1" value=".true.nominal.attribute"/>
              <parameter key="2" value=".true.nominal.attribute"/>
              <parameter key="3" value=".true.nominal.attribute"/>
              <parameter key="4" value=".true.nominal.attribute"/>
              <parameter key="5" value=".true.nominal.attribute"/>
            </list>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Regards,
    Marco
  • tennenrishintennenrishin Member Posts: 177 Contributor II
    Thanks!
Sign In or Register to comment.