Options

"Error on joining data"

lmsasulmsasu Member Posts: 20 Contributor II
edited May 2019 in Help
Hello all,

I am trying to join two different excel files, having a common ID column which can be used for joining. However, I keep receiving an error at runtime: "Process failed. duplicate attribute name: 1". The first row of each file contains the column names, the columns have different names (excepting the one used for joining). What's wrong?

Thanks in advance,
Lucian
Tagged:

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    actually I can't reproduce the problem. See here for my simple example process:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.003" expanded="true" name="Process">
        <process expanded="true" height="431" width="550">
          <operator activated="true" class="generate_data" compatibility="5.1.003" expanded="true" height="60" name="Generate Data" width="90" x="45" y="75"/>
          <operator activated="true" class="generate_id" compatibility="5.1.003" expanded="true" height="76" name="Generate ID (2)" width="90" x="246" y="165"/>
          <operator activated="true" class="select_attributes" compatibility="5.1.003" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="30">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="att2|id"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="5.1.003" expanded="true" height="76" name="Generate ID" width="90" x="246" y="300"/>
          <operator activated="true" class="select_attributes" compatibility="5.1.003" expanded="true" height="76" name="Select Attributes (2)" width="90" x="380" y="300">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="att3|id|att1"/>
          </operator>
          <operator activated="true" class="join" compatibility="5.1.003" expanded="true" height="76" name="Join" width="90" x="447" y="120"/>
          <connect from_op="Generate Data" from_port="output" to_op="Generate ID (2)" to_port="example set input"/>
          <connect from_op="Generate ID (2)" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Generate ID (2)" from_port="original" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
          <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Could you please post your process setup, too?

    Greetings,
      Sebastian
  • Options
    lmsasulmsasu Member Posts: 20 Contributor II
    Thanks for reply.

    In my case, the id columns used for joining the two data are present in both files and I don't generate them. The data

    My code:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.003" expanded="true" name="Process">
        <process expanded="true" height="639" width="805">
          <operator activated="true" class="read_excel" compatibility="5.1.003" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
            <parameter key="excel_file" value="D:\Work\lenses_part1.xls"/>
            <parameter key="imported_cell_range" value="a2:d25"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="id.true.real.id"/>
              <parameter key="1" value="col1.true.real.regular"/>
              <parameter key="2" value="col2.true.real.regular"/>
              <parameter key="3" value="col3.true.real.regular"/>
            </list>
          </operator>
          <operator activated="true" class="read_excel" compatibility="5.1.003" expanded="true" height="60" name="Read Excel (2)" width="90" x="45" y="210">
            <parameter key="excel_file" value="D:\Work\lenses_part2.xls"/>
            <parameter key="imported_cell_range" value="A2:d25"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="id.true.numeric.id"/>
              <parameter key="1" value="col4.true.numeric.regular"/>
              <parameter key="2" value="col5.true.numeric.regular"/>
              <parameter key="3" value="col6.true.numeric.regular"/>
            </list>
          </operator>
          <operator activated="true" class="join" compatibility="5.1.003" expanded="true" height="76" name="Join" width="90" x="246" y="120"/>
          <connect from_op="Read Excel" from_port="output" to_op="Join" to_port="left"/>
          <connect from_op="Read Excel (2)" from_port="output" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    The data in lenses_partX.xls are obtained from http://archive.ics.uci.edu/ml/machine-learning-databases/lenses/lenses.data, the first 4 columns are retained in lenses_part1.xls, the others are in lenses_part2.xls and the id is manually inserted into the xls file.

    Thanks,
    Lucian
  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Change this
            <parameter key="1" value="col4.true.numeric.regular"/>
              <parameter key="2" value="col5.true.numeric.regular"/>
              <parameter key="3" value="col6.true.numeric.regular"/>

    to this
            <parameter key="a" value="col4.true.numeric.regular"/>
              <parameter key="b" value="col5.true.numeric.regular"/>
              <parameter key="c" value="col6.true.numeric.regular"/>

    I think it's because the attribute names are not set; the import wizard default is to use names lke 1,2,3.



    regards

    Andrew
  • Options
    lmsasulmsasu Member Posts: 20 Contributor II
    Thanks for reply. At design time I got a warning:
    "Error transforming meta data transformation: java.lang.NumberFormatException: For input string: "a" ".

    Lucian
  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello

    Try the attached. I think it's because the range was A2:d25 which meant it was using row 2 as the names of the attributes so it failed on the initial read of the Excel files (the names of the Excel files will need to be changed for the setup you have).

    regards

    Andrew
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.004">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
       <process expanded="true" height="639" width="805">
         <operator activated="true" class="read_excel" compatibility="5.1.004" expanded="true" height="60" name="Read Excel (4)" width="90" x="45" y="120">
           <parameter key="excel_file" value="D:\work\Book2.xls"/>
           <parameter key="imported_cell_range" value="A1:C25"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations">
             <parameter key="0" value="Name"/>
           </list>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="id.true.integer.id"/>
             <parameter key="1" value="4.true.integer.attribute"/>
             <parameter key="2" value="5.true.integer.attribute"/>
           </list>
         </operator>
         <operator activated="true" class="read_excel" compatibility="5.1.004" expanded="true" height="60" name="Read Excel (3)" width="90" x="45" y="30">
           <parameter key="excel_file" value="D:\work\Book1.xls"/>
           <parameter key="imported_cell_range" value="A1:D25"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations">
             <parameter key="0" value="Name"/>
           </list>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="id.true.integer.id"/>
             <parameter key="1" value="1.true.integer.attribute"/>
             <parameter key="2" value="2.true.integer.attribute"/>
             <parameter key="3" value="3.true.integer.attribute"/>
           </list>
         </operator>
         <operator activated="true" class="join" compatibility="5.1.004" expanded="true" height="76" name="Join" width="90" x="246" y="75"/>
         <connect from_op="Read Excel (4)" from_port="output" to_op="Join" to_port="right"/>
         <connect from_op="Read Excel (3)" from_port="output" to_op="Join" to_port="left"/>
         <connect from_op="Join" from_port="join" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
  • Options
    lmsasulmsasu Member Posts: 20 Contributor II
    It did not work. However, after defining the names of the columns (I expected that checking "first row as names" brings the names of the columns) the join operation worked fine.

    Thanks for your replies.

    Lucian
Sign In or Register to comment.