Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Normalizing cumulative data

kaymankayman Member Posts: 662 Unicorn
edited December 2018 in Help

I have a vast amount of sales and repair data for a range of products, and I would like to compare repair vs sales for comparable products. Since the sales ranges can differ a lot (from hundreds to millions) I would first have to scale them all to the same range, and preferably also shift them so the timeframes match. Not all products are sold at the same time, so at a certain moment I need to concert it from a pure calendar / timescale range to something like 'first period of sales, second period and so on ' so I can compare over a 2 year scale.

 

Hope I make some sense here :-)

 

I've been playing around already but get stuck in the complexity of it. So what would the best approach to tackle this? Not looking for ready-to-use code nut more of a advice towards common sense steps to take.

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hmm that's a good question @kayman. I guess my initial thought would be to use percentages, i.e. "The first 1% of products sold/repairs did this, the next 1% of products sold/repaired did this, etc..."

     

    Scott

     

  • kaymankayman Member Posts: 662 Unicorn

    Sorry, wasn't very clear about this, bit hard to explain to be honnest

     

    My data is cummulated, so both my sales and repair data will gradually increase over time, where my sales will be stable at a certain moment (nothing to sell anymore or new product to keep the pockets empty) while my repair might go on a little longer.

     

    So the scenario is a bit as follows : 

     

    Product A - January - 1000 sales - 0 repair

    Product A - February - 2000 sales (cummulated) - 10 repair (0.5%)

    Product A - March - 3000 sales (cummulated) - 15 repair (cummulated) (0.5%)

    ...

    And so on

     

    Same goes for product B, but sales may start a little later and the quantities are not as high.

     

    Now, if I keep the quantities as they are the ratio might be the same, but for visual purposes I want to be able to plot my Sales also on the same chart to have a better trent, so for both products the sales need to be scaled to a max of one (or whatever makes sense) so products will low or high sales or both shown with the same altitude.

     

    The purpose is mainly for visual interpretation, so I can show the trent lines for a given set of products, have the sales (pareto styled) and repair in one chart so outliers or deviations are more visible.

     

    So the steps I had in mind was first to get my sales and repairs scaled so I don't have to be bothered anymore about the absolute figures, but I'm not sure what works best here. Next would be to window them so the first period of sales gets aligned independent of start time and so on. 

     

    If I only had to be bothered about the actual ratio it would be easier, but I also need the actual sales trend included as repairs will take place longer than sales, so my ratio will naturally increase after end of sales. And the trends should all be equally scaled (the right term escapes me) so they nicely fit on a powerpoint in the end...

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,528 RM Data Scientist

    @kayman,

     

    cumulative data is integrated data. Can't we just differentiate, normalize, integrate again? That would result in a dataset were the maximum delta between to values is 1.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    ah I see. Ugh I hate working with cumulative data (or any aggregated data in any form) for this reason and many many others. Sure what @mschmitz says makes sense....

     

    Scott

     

Sign In or Register to comment.