ALL FEATURE REQUESTS HERE ARE MONITORED BY OUR PRODUCT TEAM.

VOTING MATTERS!

IDEAS WITH HIGH NUMBERS OF VOTES (USUALLY ≥ 10) ARE PRIORITIZED IN OUR ROADMAP.

NOTE: IF YOU WISH TO SUGGEST A NEW FEATURE, PLEASE POST A NEW QUESTION AND TAG AS "FEATURE REQUEST". THANK YOU.

Backup of operational RapidMiner Server

landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi everybody,
I have a question regarding backup and the scenario where I have an operational RapidMiner Server. Let's say we have a busy data science team and dozens of projects are running in production. All these projects are solving various predictive tasks that are used for process control in a huge manufacturing system. As huge manufacturing systems tend to do, this is operated 24/7 to make most out of the capital that is bound in the machines.
The documentation states: Simply switch it of to avoid inconsistent backups. 
Okay. So I switch off my manufacturing plant, too? Entirely? While most of the employees migth be on my side to take a day off per week, I doubt the mangement will follow me.

So what strategies are left? How can we externally synchronize the file system and the database? Or is there an alternative approach to it that I miss?`

Hope for your ideas!

Greetings,
 Sebastian

PS: Yes, this is a feature request. Something like RM Repository should be transaction save AND have live backup capabilities.

Tagged:
2
2 votes

Open for Voting · Last Updated

IC-1434

Comments

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi Sebastian!

    This used to be easier when everything was in the database, as database dump tools can create consistent snapshots.

    On Linux there's LVM and on Windows you can do shadow copies for consistent file system backups. If you are able to start both backups (database and file system) at exactly the same time, you're mostly fine, as the time window for inconsistencies is very short, but not zero.

    However, you might be able to avoid writing to the repository in a maintenance window, e. g. 2:00-2:05 AM. Then start the backups in this time range. 
    E. g. your modelling processes would run outside of this time. 

    If you write example sets, logs and other stuff to the repository, you could instead put them into a database. 

    I still agree with you on this being a feature request. RM should make sure that every change is as "transactional" as possible. In the database, this is easy; in the file system, long operations should run into temp files and then the result would be renamed to the final name for example. 

    Regards,
    Balázs
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    edited February 2019
    pushed to Product Feedback and cc @jpuente
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Balazs, hi All,
    good point with the shadow copies, I already thought something like that would come up. But honestly, I think that is a bit beyond the average Data Scientist...and given that I rarely experience a good support from the IT (in contrary usually the IT seems to be sort of suspicious to loose control) for the Data Science team, this is a problem. In cases of conflict with the IT, this is a very bad problem for the data science team as the IT now has a lever to pull on: No Safe Backup? That's against Company principles...

    So I would really like to see a good solution to that. And an easy to use one this time...

    I don't see it sufficient that there is put some effort in to make both transactional-ish for backup (Heavens sake, I hope it already IS transactional-ish, so that a power take out cannot crash the entire repository!). In a real scenario we have dozens of projects and each of it might generate hundred of models. And models and performance vectors also cannot be written into the database (and for reasons of performance, shouldn't.). While loosing a model might not be a problem (It could be recomputed), I fear the instabilities due to conflicting data in the meta data database and on the disk...

    Some comments from the devs here? Would be great to have some insights to communicate with customers (and their IT department)

    Greetings,
    Sebastian
Sign In or Register to comment.