Vlookup in Rapid Miner?

kyrakyra Member Posts: 5 Contributor I
edited December 2018 in Help

Hello!

 

I'm quite new to Rapid Miner and was wondering if there's a VLOOKUP equivalent in it?

I work with two sets of data - the first one being sales file where our supplier lets us know which products have sold, how much etc. The second file is a reference file, which I use to check all product names with codes (ie: Product 1 = GBCIF281939499). 

 

Is there a way to do this in Rapid Miner?

 

Kyra

Tagged:
Gonfiaf_Zuraik

Answers

  • Gonfiaf_ZuraikGonfiaf_Zuraik Member Posts: 8 Contributor I

    hello @kyra

     

    if you choose the "Turbo Prep" in the Views, you will see that one of the actions there is called PIVOT, if you click it, it will give you the opportunity to group the data by the columns you want, and it also gives the possibility to aggregate. try it might be it will help you with your concern :)

    as I am also new to RapidMiner, I would be so pleased if this answer could help :smileyhappy:

     

    Cheers and Good luck

    Jana

     

     

    sgenzermschmitzrfuentealba
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 238   Unicorn

    Hi @kyra,

     

    the concept of Excel's VLOOKUP is called a "join" everywhere else.

    Use the Join operator with the two input example sets, and set the common attribute (the product code).

    Decide if you only want to keep matching rows (inner join) or also keep entries without matching product codes (left or right join).

     

    Regards,

    Balázs

    sgenzerGonfiaf_Zuraikrfuentealba
  • kyrakyra Member Posts: 5 Contributor I

    Thank you, that was super helpful!

    Is there a way to create a process where Rapid Miner creates a new column with the matching product codes?

    For example - after cross referencing both files, a new column is created in the output file called Final Product Code (for example), where it shows you a list of all the Product Codes it was able to match, and highlights the ones that weren't matching or missing?

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 238   Unicorn

    Hi @kyra!

     

    The Join operator has a setting for keeping both key attributes. (You might have to activate the Expert mode in the Parameters panel.)

    To create a new attribute, you use Generate Attributes. It has a graphical formula editor which lists the functions you'll need (if(), missing() etc.).

     

    Regards,

    Balázs

    sgenzerrfuentealba
  • kyrakyra Member Posts: 5 Contributor I
    Hi Balazs,

    Thanks for your response - perhaps I need to provide more info as I'm still having a bit of trouble creating the workflow i want:

    I work at a record label and have two sets of data that I normally work with:
    1. Statement 
    2. VLOOKUP

    When I receive a sales statement, it breaks down the our sales from that Customer for the month, line by line. 
    For example, we’ll receive a monthly sales statement from (ie) Napster that has a line by line breakdown of what digital albums and tracks were sold for the month of October
    The information they provide are things like the ISRC (track code), sale amount (per track), etc.

    My job is to a) do a VLOOKUP to make sure that the ISRC’s provided are correct and b) extract the data that I need and put it into a separate excel sheet. I put it into a format that is readable by our accounting program.

    We have a lot of Customers and each Customer delivers a different sales statement in terms of format so much of my time is spent dissecting it. What I’d like to do with Rapid Miner is automate that process where I can input a sales statement file & my VLOOKUP file and output the data in the format that is readable by our accounting program.

    The data I would need from a statement would be:
    ISRC
    Amount
    Region

    The problem I’m facing is that the statement’s columns are not necessarily named ‘ISRC’, ‘Amount’ and ‘Region’, but this is what our accounting system reads. So the output file on Rapid Miner needs to have these headers but somehow extract the information from the sales statement that has different headers.


    Ultimately, my question is what are the steps I need to take to achieve this on Rapid Miner?

    I’ve put in a Sales Statement and my VLOOKUP sheet onto Rapid Miner.
    I’ve tried the JOIN operator but not sure what settings they need to be on and not sure I’m missing anything else…(pretty sure I am!) 

    It is quite complicated and I’m happy to jump on a call to further explain as there’s a lot to it…

    Thank you in advance!

  • Knut-RMKnut-RM Administrator, Employee, Member, University Professor Posts: 126  Administrator
  • kyrakyra Member Posts: 5 Contributor I
    @Knut-RM Thanks for that - will check that out.
    Am I right in thinking that Turbo Prep is not available in the free version? Is that a paid add on?
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,444  RM Founder
    edited November 2018
    Hi @kyra,
    Am I right in thinking that Turbo Prep is not available in the free version? Is that a paid add on?
    Yes, that is correct.  Turbo Prep is part of the first 30 days to all users and afterwards it is part of RapidMiner's commercial offering.  See this link here for more information:
    Best,
    Ingo
    mschmitzsgenzer
  • M_MartinM_Martin RapidMiner Certified Analyst, Member Posts: 110   Unicorn
    Hi @kyra:  Putting whether or not Turbo Prep will be available to you aside for a moment, it sounds like you also have a Data Consistency (metadata) problem as it sounds like field names of for similar data items do not always align between systems your company uses and depends on, so matter what tool you use, there will need to be some logic (which can difficult to maintain over time) in place to resolve these inconsistencies. 
    How about setting up a "Master Mapping Table" that maps all of the field names in the data that should map back to each of your core values, like Region, ISRC, Amount, etc?   would imagine that there are other line of business applications in your company that could also make use of such a Mapping Table - so it could be worth starting a separate project to resolve these inconsistencies.  Data Warehouses couldn't function without these types of tables, which are also known as Dimension Tables.
    Once the Mapping Table has been put together, it seems to me that you could do what you want to do with the RapidMiner JOIN operator - with the join type being INNER. 
    Key point: the Mapping Table has to be maintained - because if another "business synonym" for one of your "core fields" enters the data, you'll need to map this new "synonym" to the appropriate "core" field name.  This is like what you need to do if you're tracking retail product sales - as new products come to market, you need to segment them into categories and sub-categorises so that aggregates by categories and sub-categories capture the sales of these new products.  Hope this helps, and please write back if you have any questions.  Best wishes, Michael Martin
Sign In or Register to comment.