job container forcibly killed

rur68rur68 RapidMiner Certified Analyst, Member Posts: 11 Contributor II
i run job modeling in rapidminer server. after all the operator sucessfully finished and the result sucesfully stored and correct, the process end with error state
"Job container '1' was killed forcefully and therefore the job execution has been stopped. Reason: Restart of job container has been invoked".
i didnt change the behaviour of container restart policies so it still default.
Has anyone else encountered anything similar? Any suggestions on to diagnose the issue? 

Best Answer

Answers

  • jpuentejpuente Employee, Member Posts: 53 RM Product Management
    Hi. Job containers are configured by default never to restart. If they do, it's typically because of some problem with the job. Is there any other problem in the log? Does this always happen with that job? 
  • rur68rur68 RapidMiner Certified Analyst, Member Posts: 11 Contributor II
    hi, @jpuente . thank you for your response, actually i already solved this.
    i got many warning "matrix is singular" in the log. it's probably because of my data that im trying to predict. i exclude the problem and then it run well.
    but, this eror keep coming after i did upgrade to rm server 9.6. some of my job that was ok in the previous version is end with error state like this. i don't know whats going on, is 9.6 version has a "warning sensitive" like this?

  • jpuentejpuente Employee, Member Posts: 53 RM Product Management
    Hi. No change that should have change behaviour that way. We could try to dig a bit deeper if you sent the agent config file and the full log.
  • rur68rur68 RapidMiner Certified Analyst, Member Posts: 11 Contributor II
    hi, @jpuente here's the agent config file and the log.
    fyi, the previous version i used is 9.0. and this is not the only job causing job container killed, i have another one job that always end with error state like this but the result is sucesfully stored.
  • jpuentejpuente Employee, Member Posts: 53 RM Product Management
    It looks like the JC becomes unresponsive right after completing the job. I'll share internally and see what we can find.
  • rur68rur68 RapidMiner Certified Analyst, Member Posts: 11 Contributor II
    hi thank you for your answer
    1. unfortunately it's not possible to try 9.7.1 Server/AI Hub by now. but, what's fundamental change in architecture of this versions?
    2.  i think we didnt have problem in network because it run well on others job
    3. also not the memory, i already increased the memory
    4. this is the only option i can do and i already did and it works. but still confuse why it run well in version 9.0 but 9.6 got some errors like this.
    anyway thank you very much @aschaferdiek

  • aschaferdiekaschaferdiek Employee, Member Posts: 76 RM Engineering
    There's no fundamental change in architecture from 9.6 to 9.7.1, but it's always a good idea to have the latest version running. :) From 9.0 to 9.6 there is a fundamental change, Job Agent and Job Containers communicate internally via HTTP/REST. In 9.0 there was no such communication, so this couldn't pop up because there was no link between them at all (Job Container just started as a separate and entirely standalone OS process).
    Glad that changing the properties helped. Due to the fact that this helped, it's still very likely that it's some weird networking/machine problem. The timeout message still suggests that. I know, we cannot be sure, but there's no other reason why a simple HTTP request would timeout on localhost otherwise.
    Thank you for taking the time to try this out together with me, we'll consider increasing the defaults here.
  • rur68rur68 RapidMiner Certified Analyst, Member Posts: 11 Contributor II
    hi, @aschaferdiek .
    i already added the following properties to the agent.properties file and it was work before.
    now the same job didnt sucessfull at all because of the connection refuse in the process of building the model using deep learning, the process end with error state "Job container '1' was killed forcefully and therefore the job execution has been stopped. Reason: Restart of job container has been invoked". here's i attach the log file. Any suggestions on to handling this connection refuse in the middle of the process?
    # amount of errors tolerated before shutdown
    jobagent.container.maxErrorAmountBeforeSpawn = 10

    # time between errors in milliseconds
    jobagent.container.maxTimeBetweenErrors = 10000
  • aschaferdiekaschaferdiek Employee, Member Posts: 76 RM Engineering
    Hi @rur68. The main problem seems to be the same. The container gets killed because it has been unreachable for a certain amount of retries.
    Do you see the Job Container java process running in the operating system process list during RapidMiner execution? How about resource consumption?
    For me it still seems to be an overload of resources on the Job Container machine or a network problem. Could you monitor resources during execution? You could also try setting up a Job Agent on another machine with more resources!?
Sign In or Register to comment.