Server 9.2: Writing in parallel Loops can cause defect Repository structure

landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
Hi,
if you put a Store operator inside a parallized Loop, it can happen that directories and entries are duplicated with the same name multiple times. Therefore the Store operator needs to write into a non-existent repository directory. Depending on the timing the directory is created TWICE or even more (depending on number of threads/timing). In this case you have a directory with the same name multiple times. This creates very funny effects when accessing the data...
It seems that the creation of entries is not working synchronized, so that two threads are checking the existence in parallel, both getting a non-existend back and then both create the same entries.
This problem limits the parallizability drastically as we have to switch of parallel execution for all loops writing data!
1
1 votes

Declined · Last Updated

Marking this as "Declined" because it will be part of complete rebuild of repo structure in RM 9.5. Please reopen if it is an issue afterwards. RA-1386

Comments

  • mmichelmmichel Employee, Member Posts: 129 RM Engineering
    Hi Sebastian,

    thanks for the report. I've created an internal ticket and we will check for potential solutions.
    A potential workaround in the meantime might be that you create the required folder structure before the parallelized loop operator - so the store operator only needs to persist the object itself.

    Hope this helps,
    Marcel
  • mwalochamwalocha Employee, Member Posts: 1 RM Engineering
    edited March 2019
    Hi Sebastian,
    I am currently working on this issue and could make 2 observations there:

    1. If the folder structure does not exist, it sometimes happens that a "folder already exists"-exception is thrown, but no duplicates appear.
    2. If the folder structure exists and I try to store the same file in the parallel loop, the file will sometimes appear twice or even more in the filetree in Studio.

    Regarding 1.) Did you really have duplicated folders (not files) and/or did you encounter the same exception?
    Regarding 2.) It seems that this is a rendering problem within the filetree in Studio, but the file exists physically only once. When I update the filetree with F5, the duplicates disappear. Would it be possible for you to try refreshing the filetree to see whether the duplicates really exist or it's just a rendering problem?

    It would also be good to know which operating system and setting you are using for parallel execution:
    Settings > Preferences > General > Parallel Execution

    Best,
    Marc
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Marc,
    for me it also created the folder twice. I'm very convinced that this happened directly on the server inside the JPA of the repository structure, although I did not check in the underlying database. The entries were shown twice as well in Studio as on Server web view. A refresh did not help at all. When I deleted the directory on the highest duplicated level, the other duplicate remained, but all data entries were deleted that had been inside the two directories. So the remaining directory was empty. Afterwards I could delete it as well.

    Anyway, all the three error pictures we see here should be aviodable by synchronizing CRUD operations on the repository: Whenever you create new children, rename or delete, you probably could lock the parent folder. If you are already doing so, there seems to be a problem with the implementation. If you don't synchronize the code somewhere, any arbitrary error can happen, depending on the actual timing of the parallel threads or parallel processes.

    Regarding the settings. I don't know what the settings were, it was executed on the server in some queue. OS should be Linux, but I doubt it has anything to do with that. 
    However, it might help you to know, that this error was not created with the new repository implementation. It is persistent since at least 7.1, so I believe it comes from the meta data / JPA part of the repository stored in the database.

    Greetings,
     Sebastian
  • mmichelmmichel Employee, Member Posts: 129 RM Engineering

    Hi Sebastian,

    thanks for the detailed response!

    Sadly, we cannot provide a fix for this right now as a proper synchronization would require a major refactoring. In general, we are currently investigating other options to cover these kind of use cases.

     

    Cheers,

    Marcel

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Marcel,
    can you give me some details about the reasons and your plans to resolve that? If you meet an unhappy customer, who pays 60.000 € a year for a system that just lost his trust as the repository containing work artifacts for half a million euro project budget crumbled to pieces under a regular operation, it's better to have a real good explation at hand. "Too difficult" is astonishingly not a well accepted explanation in such situations. Even more if the customer is an IT affine person, it's hard to argue if the simple Studio repository is handling that without a problem as every single commonly used file system is thread safe.

    Greetings,
    Sebastian
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    stupid question, wouldn't a setting "do not create directory" in Store work? That would demand that you create the structure upfront and would raise a user error in case of a non-existing directory.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Well...yes. If the customer thinks about checking it AND does not create the same entry in parallel. The latter obviously doesn't make sense anyway.
    But the idea is good and I think it may be even improved: I will simply make me a synchronized Store operator. If every single Store operator within one JVM instance is synchronized it is very improbable that this will happen as different processes will write into different folder anyway.
    How often does the development ask you for ideas. May it would be worthwhile.

    We will implement that as a quirky workaround in our next Jackhammer version, for everybody who has the same problem until the RapidMiner Devs come up with a real solution.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    as you know it's usually harder to not get any feedback from me than the other way around :) So literally every week.


    Best,
    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mmichelmmichel Employee, Member Posts: 129 RM Engineering
    Hi Sebastian,

    as we also want to support the high availability setup of RapidMiner Server the synchronization/locking of the folders or entries would need to be performed outside of the JVM, e.g. within the shared DB. The locking of frequently used entries will decrease the performance of the repository operations which are used by the external executors. As we are doing some changes to the repository anyway, we will incorporate some new techniques which should solve the concurrency issue without decreasing the performance. But this kind of stuff cannot be delivered with a patch release and requires a new major version.

    Cheers,
    Marcel
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Yip. I solved it already for 99,9% of the cases, that's enough until then.
Sign In or Register to comment.