Options

ETL standard text modification

Neon67Neon67 Member Posts: 6 Contributor II
edited November 2018 in Help
Hello

What are the options for preparing a dataset within Rapidminer Studio when it comes to string attributes ?.

I know I can split and Combine, but is there a way to remove White space, transform into upper lower or capital case etc, with and without generating a new attribute. Is there a dedicated Operator for this or do I have to use the Script Operator ?

I there an other Operator showing pre build function or any kind of help when comming to scripting, maccros, or regular expressions ?

thanks  ;D

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,508 RM Data Scientist
    Hi,

    Generate Attribute has quite some text function. Further there is of course the text mining extension.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    David_ADavid_A Administrator, Moderator, Employee, RMResearcher, Member Posts: 297 RM Research
    Hi,

    Generating Attribute is indeed the best way to do this.
    What you can do if you don't want to create new attributes, is just to use the old attribute name and it will overrides the values.

    For pre-build functions take a look at the tutorial process in the help to the Generate Attributes operator, it shows a lot of examples what you can do with macros and expressions.

    Best,
    David
  • Options
    Neon67Neon67 Member Posts: 6 Contributor II
    Thanks all for the information

    I have a last question though regarding simple and more complex cross table, I saw the example with pivoting operator, and also check de pivot and transpose.
    I am unable to create the same kind form on the left with the input data on the right of this picture

    image


    Any hint how to proceed ?  ???

    Thanks
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,508 RM Data Scientist
    I think you need to combine aggreagte and pivot in some way to do this. I would need to sit down to get this.

    OWC's stats extension (commerical) includes an operator which should do the trick. See: https://oldworldcomputing.com/products/statistics-extension-for-rapidminer

    Best,\Martin

    Edit: Apparently there is a demo process, see http://www.myexperiment.org/workflows/1488.html
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    Neon67Neon67 Member Posts: 6 Contributor II
    Thanks for the tip
    I reviewed the second link and the problem is that he is doing an aggregation, and in my case I shouldn't have one.
    I am looking for a free and quickest solution, the less operation needed, the better actually.

    If you have any other suggestions please feel free to share  :)
  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Here you go.  The trick is to use depivot with the right regular expression.  - just to confirm, this takes the cross-table from the Excel sheet and converts it into an exampleset as displayed on the left of your picture.

    ^(?!Type$|myID$).*$  -> this performs a negative lookahead for all atrribute names separated by | within the brackets (the $ tells it that the attribute name must end that way so "Type" is excluded from the depivot, but "Type2" is included.  Any other attributes get included in the depivoted operation.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.000">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_excel" compatibility="7.0.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="85">
           <parameter key="excel_file" value="C:\Users\user\Desktop\ExcelSummaryTable.xlsx"/>
           <list key="annotations"/>
           <list key="data_set_meta_data_information"/>
         </operator>
         <operator activated="true" class="transpose" compatibility="7.0.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="136"/>
         <operator activated="true" class="rename_by_example_values" compatibility="7.0.000" expanded="true" height="82" name="Rename by Example Values" width="90" x="179" y="238"/>
         <operator activated="true" class="rename" compatibility="7.0.000" expanded="true" height="82" name="Rename" width="90" x="313" y="238">
           <parameter key="old_name" value="Country"/>
           <parameter key="new_name" value="Type"/>
           <list key="rename_additional_attributes">
             <parameter key="A" value="myID"/>
           </list>
         </operator>
         <operator activated="true" class="de_pivot" compatibility="7.0.000" expanded="true" height="82" name="De-Pivot" width="90" x="380" y="34">
           <list key="attribute_name">
             <parameter key="myValue" value="^(?!Type$|myID$).*$"/>
           </list>
           <parameter key="index_attribute" value="Country"/>
           <parameter key="create_nominal_index" value="true"/>
         </operator>
         <connect from_op="Read Excel" from_port="output" to_op="Transpose" to_port="example set input"/>
         <connect from_op="Transpose" from_port="example set output" to_op="Rename by Example Values" to_port="example set input"/>
         <connect from_op="Rename by Example Values" from_port="example set output" to_op="Rename" to_port="example set input"/>
         <connect from_op="Rename" from_port="example set output" to_op="De-Pivot" to_port="example set input"/>
         <connect from_op="De-Pivot" from_port="example set output" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
  • Options
    Neon67Neon67 Member Posts: 6 Contributor II
    Thanks a lot  :)
Sign In or Register to comment.