I want to group different item by their brand

demonlovesongdemonlovesong Member Posts: 16 Contributor I
edited December 2018 in Help

I had bunch of items and i want to group them by their brand. The item description of the data i receive seem concatenate brand name and item name together. With varies length of brand name and now i want to group them, for example in the picture, can group all OREO Item together instead they seperated into different groups. Thank you!

Best Answers

  • FBTFBT Member Posts: 106 Unicorn
    Solution Accepted

    Do you have a dictionary containing all possible brand names? If not, I believe your best choice would be to combine the ideas of previous responses (i.e. build some regex logic) to create such a dictionary on which you can then run your grouping. This does require some manual labour and can, depending on the amount of different brand names, take up a lot of time, but based on your input data structure there is just no way to directly make an aggregation on brand name.

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Solution Accepted
    There are tons of ways to accomplish that task.

    1.- Create a CSV file with Excel and import it.
    2.- Use the Data Editor from RapidMiner to create a new data object.
    3.- Use the "Create ExampleSet" operator from the Operator Toolbox.
    4.- Create a copy of your 200000 products and create something that groups by similarities.

    Before going that way, do you mind analyzing your content with a hexadecimal editor first? Perhaps you can find a pattern that allows you to actually do the split. On Mac, you can use HexFiend, which is super easy to use.

    As of me, the most difficult way to split strings I can think of (and the one I would never choose to explain others but would probably choose for myself) is to split each string by commas, execute multiple orderings by word1, word1+word2, word1+word2+word3, until I can analyze each one in terms of depth (depth 1 would be OREO, depth 2 would be VICTORIA'S SECRET, depth 3 would be THE PEGASUS GROUP, and so on) and amount of products per depth. However that is time consuming and I would use Ruby for such a task. Please, don't follow this. I'm just being creative and encouraging you to build your own solution as preparing data doesn't have to be done in RapidMiner if you have other ways to do that.

    All the best,

    Rodrigo.

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hi @demonlovesong

     

    Use the following:

     

    • Read your data (using RetrieveRead Database or whatever).
    • Use the Generate Copy operator to create a copy of the column you want to manipulate.
    • Use the Replace operator with the following regular expression, and replace by $1

    (\S+)\s+.*

     

    This regular expression means "Capture anything that isn't a space (\S+) that comes before one or more spaces \s+ that in turn come before any kind of character .*" That is why you use $1, because you need the first (and only) string before the \s space.

     

    All the best,

     

    Rodrigo.

  • demonlovesongdemonlovesong Member Posts: 16 Contributor I

    Thank you for the solution,it is really helpful,  but what if the brand name are containing more than one word? I have 1913 item to abstract the brand name and they are in random sequence, is it achievable? 

  • kaymankayman Member Posts: 662 Unicorn

    In your example it seems the brand is separated from the other content using a tab (or multiple spaces), can you confirm that?

    If that's the case it should be fairly straighforward. Your regex needs to be adjusted as follows in case of tabs :

     

     

    ^(.*?)\t.*

     

    or, even easier : install the operator toolbox extention, and use the 'create exampleset' operator to copy your data and convert it to a dataset. Attached example gives an idea on how to do this.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="85">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Brand&#9;Type&#9;Weight&#10;Kraft&#9;Oreo Chock&#9;137G&#10;OREO&#9;Mini chocolate&#9;95G"/>
    <parameter key="column_separator" value="\t"/>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

  • demonlovesongdemonlovesong Member Posts: 16 Contributor I

    That is not the case, there are branch name in the form of following picture attacted, this is something give us a problem while creating the column. Is that anyway to attract them correctly? Thank you!

  • demonlovesongdemonlovesong Member Posts: 16 Contributor I

    Alright i see, thank you very much

  • demonlovesongdemonlovesong Member Posts: 16 Contributor I

    By the way, If i want to create a dictionary for all the brand name, how am i going to do it with rapid miner? Thank you!

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    There are tons of ways to accomplish that task.

    1.- Create a CSV file with Excel and import it.
    2.- Use the Data Editor from RapidMiner to create a new data object.
    3.- Use the "Create ExampleSet" operator from the Operator Toolbox.
    4.- Create a copy of your 200000 products and create something that groups by similarities.

    Before going that way, do you mind analyzing your content with a hexadecimal editor first? Perhaps you can find a pattern that allows you to actually do the split. On Mac, you can use HexFiend, which is super easy to use.

    As of me, the most difficult way to split strings I can think of (and the one I would never choose to explain others but would probably choose for myself) is to split each string by commas, execute multiple orderings by word1, word1+word2, word1+word2+word3, until I can analyze each one in terms of depth (depth 1 would be OREO, depth 2 would be VICTORIA'S SECRET, depth 3 would be THE PEGASUS GROUP, and so on) and amount of products per depth. However that is time consuming and I would use Ruby for such a task. Please, don't follow this. I'm just being creative and encouraging you to build your own solution as preparing data doesn't have to be done in RapidMiner if you have other ways to do that.

    All the best,

    Rodrigo.
  • demonlovesongdemonlovesong Member Posts: 16 Contributor I

    Really Appreciate your time, thank you very much and have a nice day!

Sign In or Register to comment.