Options

Extract Sheet name from an Excel file

gabriela_sanchegabriela_sanche Member Posts: 4 Contributor I
edited December 2018 in Help

Hi everyone,

I want to extract the name of the Excel sheet I'm reading to add as a new attribute.

For example:

Data      Sheet_name

A           my_sheet_name

B           my_sheet_name      

C           my_sheet_name

 

Is there a way to do this?

Thanks

Tagged:

Best Answers

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted

    Hi @gabriela_sanche

     

    There is a way to perform your task with the Execute Python operator (to download and install from marketplace),

    but you have to install Python language on your computer.

    The new attribute with the sheet name appears on the last column (new created column) : 

    sheet_name_as_attribute.png

     

    You have to modify two parameters in the Python code according to the sheets that you want study, but nothing complicated, all is commented.

    Here the process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="112" y="34">
    <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main():&#10;&#10; path = 'C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer'# path where the xls file is stored&#10; file_name = 'sheets_name_as_attribute.xlsx' # name &#10; &#10; #Read the sheet number 'x' of the Excel file &#10; &#10; data = pd.read_excel(path + '/' + file_name,sheetname = 0) #modify the value of parameter 'sheetname' &#10; #WARNING : for the &quot;sheetname' parameter, the first sheet is 0, the second 1, the third 2 etc .&#10;&#10; # get the sheets name of the excel file&#10; xls = pd.ExcelFile(path + '/' + file_name , on_demand = True)&#10; sheets = xls.sheet_names&#10;&#10; &#10; data['Sheet_name'] = str(sheets[0])# modify [0] according to the value of parameter 'sheetname' above&#10; &#10;&#10; # connect 2 output ports to see the results&#10; return data"/>
    </operator>
    <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    In attached file, my fictive .xls file.

    This is not the easier way, but maybe it will be helpful

     

    Regards,

     

    Lionel

     

     

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Solution Accepted

    ok I hate it when people cop out and use python :)  RapidMiner all the way.  Here you go:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
    <parameter key="excel_file" value="/Users/genzerconsulting/Desktop/FOOO/FOOO.xlsx"/>
    <parameter key="imported_cell_range" value="A1:C263"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="foo.true.polynominal.attribute"/>
    <parameter key="1" value="foo2.true.polynominal.attribute"/>
    <parameter key="2" value="C.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="124" name="Multiply" width="90" x="179" y="136"/>
    <operator activated="true" class="operator_toolbox:get_parameters" compatibility="0.7.000" expanded="true" height="103" name="Get Parameters" width="90" x="313" y="34">
    <parameter key="Operator name" value="Read Excel"/>
    </operator>
    <operator activated="true" class="converters:parameter_set_2_example_set" compatibility="0.3.001" expanded="true" height="103" name="Parameter Set to ExampleSet" width="90" x="447" y="34"/>
    <operator activated="true" class="extract_macro" compatibility="8.0.001" expanded="true" height="68" name="Extract Macro" width="90" x="581" y="34">
    <parameter key="macro" value="sheetName"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="Read Excel.sheet_number"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="annotations_to_data" compatibility="8.0.001" expanded="true" height="82" name="Annotations to Data" width="90" x="313" y="238"/>
    <operator activated="true" class="extract_macro" compatibility="8.0.001" expanded="true" height="68" name="Extract Macro (2)" width="90" x="447" y="238">
    <parameter key="macro" value="file_path"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="value"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="loop_zipfile_entries" compatibility="8.0.001" expanded="true" height="82" name="Loop Zip-File Entries" width="90" x="581" y="238">
    <parameter key="filename" value="%{file_path}"/>
    <parameter key="filter" value="workbook.xml"/>
    <parameter key="recursive" value="true"/>
    <process expanded="true">
    <operator activated="true" class="advanced_file_connectors:read_xml" compatibility="8.0.001" expanded="true" height="68" name="Read XML" width="90" x="179" y="34">
    <parameter key="file" value="/Users/genzerconsulting/Desktop/FOOO/xl/workbook.xml"/>
    <parameter key="xpath_for_examples" value="//default:workbook/default:sheets"/>
    <enumeration key="xpaths_for_attributes">
    <parameter key="xpath_for_attribute" value="default:sheets[1]/text()"/>
    <parameter key="xpath_for_attribute" value="default:sheet[1]/attribute::name"/>
    <parameter key="xpath_for_attribute" value="default:sheet[1]/attribute::r:id"/>
    <parameter key="xpath_for_attribute" value="default:sheet[1]/attribute::sheetId"/>
    <parameter key="xpath_for_attribute" value="default:sheet[1]/text()"/>
    <parameter key="xpath_for_attribute" value="default:sheet[2]/attribute::name"/>
    <parameter key="xpath_for_attribute" value="default:sheet[2]/attribute::r:id"/>
    <parameter key="xpath_for_attribute" value="default:sheet[2]/attribute::sheetId"/>
    <parameter key="xpath_for_attribute" value="default:sheet[2]/text()"/>
    </enumeration>
    <list key="namespaces">
    <parameter key="mx" value="http://schemas.microsoft.com/office/mac/excel/2008/main"/>
    <parameter key="x14" value="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"/>
    <parameter key="x15ac" value="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac"/>
    <parameter key="x15" value="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main"/>
    <parameter key="mc" value="http://schemas.openxmlformats.org/markup-compatibility/2006"/>
    <parameter key="r" value="http://schemas.openxmlformats.org/officeDocument/2006/relationships"/>
    <parameter key="default" value="http://schemas.openxmlformats.org/spreadsheetml/2006/main"/>
    </list>
    <parameter key="default_namespace" value="http://schemas.openxmlformats.org/spreadsheetml/2006/main"/>
    <list key="annotations"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="default:sheets[1]/text().true.attribute_value.attribute"/>
    <parameter key="1" value="default:sheet[1]/attribute::name.true.attribute_value.attribute"/>
    <parameter key="2" value="default:sheet[1]/attribute::r:id.true.attribute_value.attribute"/>
    <parameter key="3" value="default:sheet[1]/attribute::sheetId.true.attribute_value.attribute"/>
    <parameter key="4" value="default:sheet[1]/text().true.attribute_value.attribute"/>
    <parameter key="5" value="default:sheet[2]/attribute::name.true.attribute_value.attribute"/>
    <parameter key="6" value="default:sheet[2]/attribute::r:id.true.attribute_value.attribute"/>
    <parameter key="7" value="default:sheet[2]/attribute::sheetId.true.attribute_value.attribute"/>
    <parameter key="8" value="default:sheet[2]/text().true.attribute_value.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="de_pivot" compatibility="8.0.001" expanded="true" height="82" name="De-Pivot" width="90" x="380" y="34">
    <list key="attribute_name">
    <parameter key="Sheet Name" value=".*name"/>
    <parameter key="Sheet Number" value=".*sheetId"/>
    </list>
    <parameter key="index_attribute" value="foo"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Sheet Name|Sheet Number"/>
    </operator>
    <connect from_port="file object" to_op="Read XML" to_port="file"/>
    <connect from_op="Read XML" from_port="output" to_op="De-Pivot" to_port="example set input"/>
    <connect from_op="De-Pivot" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="8.0.001" expanded="true" height="82" name="Append" width="90" x="715" y="238"/>
    <operator activated="true" class="generate_attributes" compatibility="8.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="849" y="136">
    <list key="function_descriptions">
    <parameter key="Sheet_name" value="%{sheetName}"/>
    </list>
    </operator>
    <operator activated="true" class="join" compatibility="8.0.001" expanded="true" height="82" name="Join" width="90" x="983" y="187">
    <parameter key="join_type" value="left"/>
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="Sheet_name" value="Sheet Number"/>
    </list>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Get Parameters" to_port="through 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 3" to_op="Annotations to Data" to_port="object"/>
    <connect from_op="Get Parameters" from_port="parameters" to_op="Parameter Set to ExampleSet" to_port="parameters"/>
    <connect from_op="Parameter Set to ExampleSet" from_port="exampleSet" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Annotations to Data" from_port="annotations" to_op="Extract Macro (2)" to_port="example set"/>
    <connect from_op="Loop Zip-File Entries" from_port="out 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_op="Join" to_port="right"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Join" to_port="left"/>
    <connect from_op="Join" from_port="join" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

     

Answers

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @gabriela_sanche - absolutely if you use the "Operator Toolbox" extension.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
    <parameter key="excel_file" value="/Users/genzerconsulting/Desktop/move to RM computer/dan.xlsx"/>
    <parameter key="imported_cell_range" value="A1:C263"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="foo.true.polynominal.attribute"/>
    <parameter key="1" value="foo2.true.polynominal.attribute"/>
    <parameter key="2" value="C.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="246" y="136"/>
    <operator activated="true" class="operator_toolbox:get_parameters" compatibility="0.7.000" expanded="true" height="103" name="Get Parameters" width="90" x="380" y="34">
    <parameter key="Operator name" value="Read Excel"/>
    </operator>
    <operator activated="true" class="converters:parameter_set_2_example_set" compatibility="0.3.001" expanded="true" height="103" name="Parameter Set to ExampleSet" width="90" x="514" y="34"/>
    <operator activated="true" class="extract_macro" compatibility="8.0.001" expanded="true" height="68" name="Extract Macro" width="90" x="648" y="34">
    <parameter key="macro" value="sheetName"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="Read Excel.sheet_number"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="849" y="136">
    <list key="function_descriptions">
    <parameter key="Sheet_name" value="%{sheetName}"/>
    </list>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Get Parameters" to_port="through 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Get Parameters" from_port="parameters" to_op="Parameter Set to ExampleSet" to_port="parameters"/>
    <connect from_op="Parameter Set to ExampleSet" from_port="exampleSet" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

  • Options
    gabriela_sanchegabriela_sanche Member Posts: 4 Contributor I

    Thank you @sgenzer for your answer. Great extension, I haven't used it before.

    However I wish to extract the sheet NAME not the number.

    For example, for the Excel in the image I wish to extract "Name 10".

     

    Screen Shot 2017-12-20 at 4.10.27 PM.png

     

    Is there a way to do this?

  • Options
    gabriela_sanchegabriela_sanche Member Posts: 4 Contributor I

    @lionelderkrikor Thanks!

    Now I know there's no direct (RM operator) to do it, but it's exactly the result I was looking for.

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @sgenzer

     

    you have to be indulgent, i'm still in a learning phase......:smileyhappy:

    OK, I recorded this method.

     

    Best regards,

     

    Lionel

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist

    @lionelderkrikor,

     

    no worries. We are all learning. I am using the product for 7 years and @sgenzer just suprised me with this solution.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi,

     

    reading the .xlsx file as Zip and processing XML is a brilliant idea, but quite involved and it won't work with old .xls files. 

     

    This should really be a setting in Read Excel (boolean: Return sheet name, text: Sheet name macro). Then you could even process it in loops that read all the sheets in a file and would get full metadata.

     

    Regards,

    Balázs

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist

    Hi,

     

    @BalazsBarany i was thinking at some point to write a "Sheetname to ExampleSet" operator. It's easy to write. Input: either a file object or a path to a file. Output: Exampleset with sheetId and sheetname.

     

    Would that do the trick for you? If yes - i would write it just for you :) I still owe you quite some favours. E.g. for the generic join script.

     

    Cheers,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi Martin,

     

    I'm not the original poster who requested the feature. I just described the idea for extending the operator.

     

    Your solutions sounds interesting. Let's think about implications for the user and for the performance.

     

    I think users would expect this functionality directly in Read Excel, not as a separate operator. If it's a special operator, you would need to join its results afterwards with the example set from the worksheet, or filter the example set and use Extract Macro to get the name of the one worksheet you're interested in. It's of course better if the user wants all worksheet numbers and names. 

     

    Performance-wise, I'm sure that reading worksheet names is a part of the backend library that already handles both .xls and .xlsx, and when the file is already open, the performance impact shouldn't be big. That would favor a solution integrated in Read Excel.

     

    Another possible implementation would be a Loop Worksheets operator. It would be similar to Loop Zip-file entries. (And nobody would search for it under Process Control/Loops ;-))

     

    Regards,

    Balázs

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    I really like "loop worksheets" - that makes the most sense to me.

     

    @lionelderkrikor - sorry didn't mean to put down your skills.  Python is perfectly legitimate and what most people would do.  I just take it as a personally challenge to do things purely in RapidMiner.  :)

     

    Scott

     

     

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist

    @BalazsBarany

    i am not sure if i agree with you. The usual way to use "a table with sheetnames" would be a loop over them and read some of them (or all) with distinct Read Excel operators. Using a Read Excel to get this list seems odd in this case, right? So having a "Read Excel Sheet Names"  sounds more reasonable to me.

     

    But maybe our users can add a few thoughts. @gabriela_sanche @lionelderkrikor what would you prefer?

     

    Side note: I can write a new operator but not extend the read excel. But i would of course file a feature request if read excel is the best way to do it.

     

    Cheers,

    martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    gabriela_sanchegabriela_sanche Member Posts: 4 Contributor I

    @mschmitz Indeed, I wish to read the sheet name to have control over multiple excel files (inside a Loop Files Operator and then a normal Loop to read the different sheets).

    However, I often encounter important information as the sheet name, so having it as metadata directly from the Read Excel would be really helpful.

     

    Thank you all for your interest in this problem :)

     

     

     

     

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    @sgenzer,

     

    No problem : I understood of course that there was a personnal challenge and it was useful for me : It's a way for me to learn about the RM operators and to improve my skills on RapidMiner.

    So for next topics, if I persist to use Python, don't hesitate to post a 100% RM process ........:smileyvery-happy:

     

    Regards,

     

    Lionel

     

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    @mschmitz,

     

    On the substance of the subject, i have no preference between the suggested solutions BUT 

    for me, for a better user experience and more efficiency , a dedicated operator like "Read Excel Sheet Names" is more adapted : 

    In deed, based on my own experience, when I try to find an operator to perform a function,

    it's easier to find and use an operator with an explicit name, with a simply input , a

    simply output (in particular for non experiment users), few parameters. It's more difficult to discover that the function

    can be performed by searching in the large numbers of parameters of an (existing) operator (which has sometimes no explicit name for the searched function).

     

    I hope that it's understandable and it will help to improve the features of RapidMiner.

     

    Best regards,

     

    Lionel

     

     

     

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist

    Hi @lionelderkrikor,

    the point here is that there is some friction between having more explicit operators for one task and having too many operators. To learn rapidminer you need to learn some kind of "dictionary" of operators and their purpose. We did some internal studies on how many operators you need to be good in RM and having more is not always better. I can see you argument for this but it's a two-sided one.

     

    @gabriela_sanche thanks for your feedback! I will have a look if i can write this operator.

     

    Cheers,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.