Options

Processing high volumes

peleitorpeleitor Member Posts: 10 Contributor II
edited November 2018 in Help
Hello fellows.

We need to process a considerable volume of data, about 1 million retail ticket lines per day. Altough this is a high value, maybe it does not deserve to be considered actually as a 'big data' scenario.

Can anyone assert or deny this assumption? And if this is should be considered big data, which would be the recommended approach using Rapidminer?

Thanks

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    Hi!

    There are different points to consider

    1. What is the actual datasize? Smaller than 32GB?
    2. What do you want to do it? Aggregate? Or learn on 1 million examples?

    If the data set is smaller than your RAM everything should be fine, as long as the actual #examples is low enough for reasonable runtimes. Otherwise you might simply sample before hand.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    peleitorpeleitor Member Posts: 10 Contributor II
    Hello, thanks for your reply.

    1. We might take representative samples that could fit into 32 Gb. Full data set size largely exceeds that.

    2. Aggregation could be solved right by SQL -this is a relational database. But for mining purposes -association detection like MBA, or other predictive methods like decision trees or lineal/logistic regression-

    The big question here is if we would need some big data processing architecture (eg. Hadoop based) standing between the RDBMS and the mining software.

    Regards
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    Hi,

    there are a few ways to handle this. since the total datasize is most likely > your RAM you need a special infrastructure

    Way 1: Use a Hadoop cluster, sample your data, learn on the sampled data in-memory and apply in-hadoop
    Way 2: Use a Hadoop cluster and learn directly in-hadoop. Radoop currently supports quite some algorithms (Decision Tree, Naive Bayes, Logistic Regression) and some more are to come
    Way 3: Use either a Hadoop Cluster or some SQL DWH to just use aggregates / representatives to work on.

    I think Way 3 might not be suited for you. Since it is about Radoop i would ask you to contact our sales team ( e.g. here: https://rapidminer.com/contact-sales-request-demo/ ). Then we (or one of my colleagues) might have a Webex or so about it.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    peleitorpeleitor Member Posts: 10 Contributor II
    Thanks Martin!

    Do you think solving this via Hadoop/Radoop is a typical situation in the reatil industry? (Eg. one retail store with 20 branches on 2M potential customers)

    Regards
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    Hi,

    since i am consultant in Germany, i can hardly speak about the non-german market. What i experienced is, that more and more companies are shifting towards such an infrastructure. However in germany it is really a "still shifting". It is visible that the usage of data gets more and more a requirement instead of a nice to have.
    From what i heard the U.S companies are faster in the process of adapting.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    peleitorpeleitor Member Posts: 10 Contributor II
    Thanks for the feedback!
Sign In or Register to comment.