Vlookup in Rapid Miner?

kyra · October 2018

Hello!

I'm quite new to Rapid Miner and was wondering if there's a VLOOKUP equivalent in it?

I work with two sets of data - the first one being sales file where our supplier lets us know which products have sold, how much etc. The second file is a reference file, which I use to check all product names with codes (ie: Product 1 = GBCIF281939499).

Is there a way to do this in Rapid Miner?

Kyra

Gonfiaf_Zuraik · October 2018

hello @kyra,

if you choose the "Turbo Prep" in the Views, you will see that one of the actions there is called PIVOT, if you click it, it will give you the opportunity to group the data by the columns you want, and it also gives the possibility to aggregate. try it might be it will help you with your concern

as I am also new to RapidMiner, I would be so pleased if this answer could help :smileyhappy:

Cheers and Good luck

Jana

BalazsBarany · October 2018

Hi @kyra,

the concept of Excel's VLOOKUP is called a "join" everywhere else.

Use the Join operator with the two input example sets, and set the common attribute (the product code).

Decide if you only want to keep matching rows (inner join) or also keep entries without matching product codes (left or right join).

Regards,

Balázs

kyra · October 2018

Thank you, that was super helpful!

Is there a way to create a process where Rapid Miner creates a new column with the matching product codes?

For example - after cross referencing both files, a new column is created in the output file called Final Product Code (for example), where it shows you a list of all the Product Codes it was able to match, and highlights the ones that weren't matching or missing?

BalazsBarany · October 2018

Hi @kyra!

The Join operator has a setting for keeping both key attributes. (You might have to activate the Expert mode in the Parameters panel.)

To create a new attribute, you use Generate Attributes. It has a graphical formula editor which lists the functions you'll need (if(), missing() etc.).

Regards,

Balázs

kyra · November 2018

Hi Balazs,

Thanks for your response - perhaps I need to provide more info as I'm still having a bit of trouble creating the workflow i want:

I work at a record label and have two sets of data that I normally work with:

Statement
VLOOKUP

When I receive a sales statement, it breaks down the our sales from that Customer for the month, line by line.

For example, we’ll receive a monthly sales statement from (ie) Napster that has a line by line breakdown of what digital albums and tracks were sold for the month of October

The information they provide are things like the ISRC (track code), sale amount (per track), etc.

My job is to a) do a VLOOKUP to make sure that the ISRC’s provided are correct and b) extract the data that I need and put it into a separate excel sheet. I put it into a format that is readable by our accounting program.

We have a lot of Customers and each Customer delivers a different sales statement in terms of format so much of my time is spent dissecting it. What I’d like to do with Rapid Miner is automate that process where I can input a sales statement file & my VLOOKUP file and output the data in the format that is readable by our accounting program.

The data I would need from a statement would be:

ISRC

Amount

Region

The problem I’m facing is that the statement’s columns are not necessarily named ‘ISRC’, ‘Amount’ and ‘Region’, but this is what our accounting system reads. So the output file on Rapid Miner needs to have these headers but somehow extract the information from the sales statement that has different headers.

—

Ultimately, my question is what are the steps I need to take to achieve this on Rapid Miner?

I’ve put in a Sales Statement and my VLOOKUP sheet onto Rapid Miner.

I’ve tried the JOIN operator but not sure what settings they need to be on and not sure I’m missing anything else…(pretty sure I am!)

It is quite complicated and I’m happy to jump on a call to further explain as there’s a lot to it…

Thank you in advance!

Knut-RM · November 2018

@kyra this might help. See the second part. https://community.rapidminer.com/discussion/52340/merging-data-sets-with-rapidminer-turbo-prep#latest

kyra · November 2018

@Knut-RM Thanks for that - will check that out.
Am I right in thinking that Turbo Prep is not available in the free version? Is that a paid add on?

IngoRM · November 2018

Hi @kyra,

Am I right in thinking that Turbo Prep is not available in the free version? Is that a paid add on?

Yes, that is correct. Turbo Prep is part of the first 30 days to all users and afterwards it is part of RapidMiner's commercial offering. See this link here for more information:

https://rapidminer.com/pricing/

Best,

Ingo

M_Martin · November 2018

Hi @kyra: Putting whether or not Turbo Prep will be available to you aside for a moment, it sounds like you also have a Data Consistency (metadata) problem as it sounds like field names of for similar data items do not always align between systems your company uses and depends on, so matter what tool you use, there will need to be some logic (which can difficult to maintain over time) in place to resolve these inconsistencies.
How about setting up a "Master Mapping Table" that maps all of the field names in the data that should map back to each of your core values, like Region, ISRC, Amount, etc? would imagine that there are other line of business applications in your company that could also make use of such a Mapping Table - so it could be worth starting a separate project to resolve these inconsistencies. Data Warehouses couldn't function without these types of tables, which are also known as Dimension Tables.
Once the Mapping Table has been put together, it seems to me that you could do what you want to do with the RapidMiner JOIN operator - with the join type being INNER.
Key point: the Mapping Table has to be maintained - because if another "business synonym" for one of your "core fields" enters the data, you'll need to map this new "synonym" to the appropriate "core" field name. This is like what you need to do if you're tracking retail product sales - as new products come to market, you need to segment them into categories and sub-categorises so that aggregates by categories and sub-categories capture the sales of these new products. Hope this helps, and please write back if you have any questions. Best wishes, Michael Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Vlookup in Rapid Miner?

Answers